1 Setup

This section loads the necessary libraries for the analysis and sets some initial parameters such as color mapping and random seed for reproducibility. Loading the libraries ensures that all required functions and tools are available for subsequent data processing and analysis steps. Setting a random seed ensures that any random operations produce the same results each time the code is run, enhancing reproducibility.

2 Data Loading

In this section, the raw data files are loaded into R. This includes the main dataset and the metadata file, which will be used for subsequent analysis steps. The data loading step ensures that the necessary datasets are available in the R environment for processing. The final output directory is also defined here, which will store the results of the analysis. If the directory does not exist, it will be created.

# Adjust path as necessary for your file locations
dataset_unsorted <- fread("data/report.pg_matrix.tsv")[, ID := .I]
meta_data <- fread("data/SampleDescription.csv")

# Define the final output directory or set NULL if results should note be saved
final_out_dir <- "results"
# Creates the folder if it doesn't exist
if(!is.null(final_out_dir)){dir.create(final_out_dir, recursive = TRUE, showWarnings = FALSE)}

3 Data Preperation

3.1 Data Trimming

The column names of the loaded dataset are cleaned up in this step. This involves removing specific prefixes and suffixes to standardize the column names, making them easier to work with. Standardizing column names helps in avoiding confusion and errors in subsequent data processing steps.

# Clean up column names in dataset
# Remove specific prefix and suffix from column names
prefix <- "/mnt/ag_proteomik/shared_worktmp/20240126_CaGe_Knochenextraktion ohne Label_Thess/Knochenextrakte_"
suffix <- "_Slot.*"

names(dataset_unsorted) <- sub(suffix, "", sub(prefix, "", names(dataset_unsorted)))

# Perform the replacements
meta_data[Method == "2step+", Method := "2step+M"]
meta_data[Method == "1step+", Method := "1step+C"]

3.2 Data Sorting

In this section, the metadata is prepared for sorting, and the main dataset’s columns are reordered based on this sorted metadata. This ensures that the dataset columns follow a meaningful order, aligning with the metadata. Sorting the columns in a consistent order helps in maintaining data integrity and makes further analysis more straightforward.

# Prepare meta_data for sorting
# Extract numerical part from 'Column' for sorting purposes
meta_data[, M_value := as.numeric(sub("M", "", sub("-.*", "", Column)))]

# Define custom order for Buffer_type and apply it
buffer_order <- c("AS", "AI", "AS1", "AI1", "AI2", "AS2", "single")
meta_data[, Buffer_name := factor(Buffer_name, levels = buffer_order)]

# Sort meta_data based on multiple criteria
ordered_meta_data <- meta_data[order(M_value, Repl_Techn, Repl_Biol, Buffer_name)]

# Sort column names in dataset_unsorted
descr_columns <- names(dataset_unsorted)[!names(dataset_unsorted) %in% ordered_meta_data$Column]
new_order <- c(descr_columns, ordered_meta_data$Column)
dataset_sorted <- dataset_unsorted[, ..new_order]

This completes the setup and initial data preparation steps, ensuring that the data is properly loaded, cleaned, and organized for further analysis.

4 Normalization of LfQ data

This normalization technique adjusts the values in each column based on the average intensity observed across similar types of samples (as grouped by Sample_Type). By scaling each column to match the average, the method attempts to correct for any discrepancies in sample loading or measurement that might have resulted in abnormally high or low readings. The goal is to bring all samples to a common scale, making it easier to compare results across different samples or conditions without the confounding effect of varying starting intensities.

4.1 Sample Loading Functions

First, we define a function to calculate the intensity sums. This function filters out rows that have any missing values (NAs) in the columns matching the pattern “^M[1-5]” and then calculates the sum of each of these columns. This helps ensure that only complete cases are considered in the sum calculations.

# Define the function to calculate intensity sums
calculate_intensity_sums <- function(df) {
  # Apply filter and selection operations
  filtered_df <- df %>%
    # Filter out rows that have any NAs in columns matching "^M[1-5]"
    filter(rowSums(is.na(select(., matches("^M[1-5]")))) == 0)
  
  # Print the number of rows after filtering
  message("Number of rows after filtering: ", nrow(filtered_df))
  
  # Select columns that match the pattern "^M[1-5]" 
  # and calculate the sum of each selected column
  intensity_sums <- filtered_df %>%
    select(matches("^M[1-5]")) %>%
    colSums()
  
  # Return the resulting sums
  return(intensity_sums)
}

Next, we define a function to calculate normalization factors. This function first calculates the column sums using the previously defined function. It then computes the normalization factors for each unique sample type by dividing the median sum of each sample type by the respective intensity sum.

calculate_norm_fact <- function(df, sampledescription) {
  # Calculate column sums after filtering NA values
  intensity_sums <- calculate_intensity_sums(df)

  # Create a mapping from column names to sample types
  col_sample_type <- sampledescription$Sample_Type[match(names(intensity_sums), sampledescription$Column)]
  
  # Compute normalization factors for each unique sample type
  norm_fact <- sapply(unique(col_sample_type), function(st) {
    median_sum <- median(intensity_sums[sampledescription$Sample_Type == st])
    norm_fact <- median_sum / intensity_sums[sampledescription$Sample_Type == st]
    return(norm_fact)
  })

  # Unlist normalization factors and adjust names
  norm_fact <- setNames(
    unlist(norm_fact), 
    sub(".*\\.", "", names(unlist(norm_fact)))
  )

  # Return the calculated normalization factors
  return(norm_fact)
}

Finally, we define a function to apply the normalization factors to the dataframe. This function ensures that the normalization factors’ names directly match the dataframe’s column names and then multiplies each column by its corresponding normalization factor.

normalize_df <- function(df, norm_fact) {
  # Ensure the normalization factors' names directly match the df's column names
  norm_cols <- names(norm_fact)
  
  # Loop through each column name that needs normalization
  for (col_name in norm_cols) {
    if (col_name %in% names(df)) {
      # Multiply the column by its corresponding normalization factor
      df[[col_name]] <- df[[col_name]] * norm_fact[col_name]
    }
  }
  
  return(df)
}

4.2 Normalize

Putting it all together, we can now normalize our dataset. First, we calculate the normalization factors, and then we apply these factors to our dataset.

norm_fact <- calculate_norm_fact(dataset_sorted, ordered_meta_data)
## Number of rows after filtering: 132
dataset_norm <- normalize_df(dataset_sorted, norm_fact)

write.csv(dataset_norm,  file.path(final_out_dir, "normalized_dataset.csv"), row.names = FALSE)

This completes the normalization process, ensuring that all samples are brought to a common scale for more accurate comparison of results across different samples or conditions. A total of 132 rows were used, as only those had measured intensities in all samples.

4.3 Visualization

# Melt the datasets for ggplot2
dataset_sorted_melt <- melt(dataset_sorted, id.vars = c("Protein.Group", "Protein.Ids", "Protein.Names", "Genes", "First.Protein.Description", "ID"))
dataset_norm_melt <- melt(dataset_norm, id.vars = c("Protein.Group", "Protein.Ids", "Protein.Names", "Genes", "First.Protein.Description", "ID"))

# Add a column to distinguish between normalized and unnormalized
dataset_sorted_melt$Type <- "Unnormalized"
dataset_norm_melt$Type <- "Normalized"

# Combine the datasets
combined_data <- rbind(dataset_sorted_melt, dataset_norm_melt)

# Plot the boxplots
p <- ggplot(combined_data, aes(x = variable, y = log10(value), fill = Type)) +
  geom_boxplot(outlier.shape = NA) + # Remove outliers for clarity
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(title = "Boxplot of Intensities: Normalized vs Unnormalized", x = "Sample", y = "log10 Intensity")

ggsave(file.path(final_out_dir, "Intensitied_pre_and_post_normalization.png"), plot = p)
## Saving 7 x 5 in image
## Warning: Removed 336184 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
print(p)
## Warning: Removed 336184 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

5 Method comparisons

5.1 Data Transformation and Merging

# Melt dataset_norm into a long format
dataset_long <- data.table::melt(dataset_norm, 
                                 id.vars = descr_columns, 
                                 variable.name = "Sample.ID", 
                                 value.name = "maxLfQ_intensities",
                                 variable.factor = FALSE)

# Merge dataset_long with meta_data
dataset_merged <- merge(dataset_long, meta_data, by.x = "Sample.ID", by.y = "Column", all.x = TRUE)

write_tsv(dataset_merged, file.path(final_out_dir, "report.pg_matrix_long.tsv"))

5.2 Filtering Rows Based on Replicate Consistency

In this step, we filter the rows of the dataset to retain only those entries that are present in at least 3 out of 4 biological replicates for each combination of method and buffer type. This ensures that our analysis focuses on consistently detected proteins across multiple biological replicates, enhancing the reliability of the results.

# Filter rows where technical replicate is 1 and maxLfQ_intensities is not NA
dataset_filtered <- dataset_merged[Repl_Techn == 1 & !is.na(maxLfQ_intensities)]

# Count unique biological replicates for each combination of method, buffer type, and ID
dataset_filtered_repl <- dataset_filtered[, .(Unique_Repl_Biol_Count = uniqueN(Repl_Biol)), by = .(Method, Buffer_name, Buffer_type, ID)]

# Filter rows where the count of unique biological replicates is at least 3
dataset_filtered_in_3_of_4_repl <- dataset_filtered_repl[Unique_Repl_Biol_Count >= 3]

dataset_filtered_in_3_of_4 <- merge(
  dataset_filtered, 
  dataset_filtered_in_3_of_4_repl[, .(Method, Buffer_name, ID)], 
  by = c("Method", "Buffer_name", "ID")
)

5.3 Statistics

5.3.1 Calculate

First, we calculate various statistics for each sample type. This includes median LFQ, mean LFQ, standard deviation, and coefficient of variation. Additionally, we extract the intensities for each biological replicate.

# Calculate statistics for each sample type
stats_df <- dataset_filtered_in_3_of_4 %>%
  group_by(ID, Protein.Group, Genes, First.Protein.Description, Method, Buffer_name) %>%
  summarize(
    #MedianLFQ = median(maxLfQ_intensities, na.rm = TRUE),
    MeanLFQ = mean(maxLfQ_intensities, na.rm = TRUE),
    SDLFQ = sd(maxLfQ_intensities, na.rm = TRUE),
    CVLFQ = (sd(maxLfQ_intensities, na.rm = TRUE) / mean(maxLfQ_intensities, na.rm = TRUE)) * 100,
    .groups = 'drop'
  )

# Prepend "M" to the existing M_value in meta_data
meta_data <- meta_data %>% mutate(M = paste0("M", M_value))

# Calculate the intensities for each biological replicate
replicate_intensities <- dataset_filtered %>%
  group_by(ID, Protein.Group, Genes, First.Protein.Description, M_value, Buffer_name, Repl_Biol) %>%
  summarize(Intensity = mean(maxLfQ_intensities, na.rm = TRUE), .groups = 'drop') %>%
  mutate(M = paste0("M", M_value)) %>%
  mutate(B = paste0("B", Repl_Biol)) %>%
  unite("M_B_Repl", M, Buffer_name, B, sep = "-") %>%
  select(-Repl_Biol, -M_value) %>%
  pivot_wider(names_from = M_B_Repl, values_from = Intensity)

# Create a mapping for Method to the corresponding M value
method_mapping <- unique(meta_data[, .(Method, M)])

# Add a new column M-B to stats_df using the mapping
stats_df_long <- stats_df %>%
  group_by(ID) %>%
  summarize(Occurrence = paste(unique(Method), collapse = ";")) %>%
  right_join(stats_df, by = "ID") %>%
  left_join(method_mapping, by = "Method") %>%
  mutate(M_B = paste0(M, "-", Buffer_name)) %>%
  select(-Method, -Buffer_name) %>%
  pivot_longer(cols = c(MeanLFQ, SDLFQ, CVLFQ), names_to = "Metric", values_to = "Value") %>%
  unite("M_B_Metric", M_B, Metric, sep = "_") %>%
  select(-M) %>%
  pivot_wider(names_from = M_B_Metric, values_from = Value) %>%
  relocate(Occurrence, .after = First.Protein.Description) # Moves Occurrence to the 5th position

# Merge with replicate intensities and arrange columns
final_df <- stats_df_long %>%
  right_join(replicate_intensities, by = c("ID", "Protein.Group", "Genes", "First.Protein.Description")) %>%
  arrange(ID)

5.3.2 Sort

Next, we dynamically generate the column order to ensure the columns are sorted first by replicate and then by metrics for each Method x Buffer combination.

# Filter to only include Repl_Techn == 1
ordered_meta_copy <- ordered_meta_data %>%
  filter(Repl_Techn == 1) %>%
  arrange(Method, Buffer_name, Repl_Biol)

# Function to generate columns for each Method x Buffer combination
generate_columns <- function(method, buffer, m_value) {
  replicate_cols <- paste0("M", m_value, "-", buffer, "-B", 1:4)
  metric_cols <- paste0("M", m_value, "-", buffer, "_", c("MeanLFQ", "SDLFQ", "CVLFQ"))
  c(replicate_cols, metric_cols)
}

method_buffer_combinations <- ordered_meta_copy %>%
  distinct(Method, Buffer_name, M_value)

# Generate the column order dynamically
all_cols <- c("ID", "Protein.Group", "Genes", "First.Protein.Description", "Occurrence")
all_cols <- unlist(lapply(1:nrow(method_buffer_combinations), function(i) {
  generate_columns(method_buffer_combinations$Method[i], 
                   method_buffer_combinations$Buffer_name[i], 
                   method_buffer_combinations$M_value[i])
}))

# Ensure that all_cols contains only the columns present in final_df
all_cols <- c("ID", "Protein.Group", "Genes", "First.Protein.Description", "Occurrence", all_cols[all_cols %in% names(final_df)])

# Select columns in the specified order
final_df <- final_df %>%
  select(all_of(all_cols))

5.3.3 Save

Finally, we save the sorted dataframe to a CSV file.

# Save the final dataframe to CSV
write.csv(final_df, file.path(final_out_dir, "statistics_df.csv"), row.names = FALSE)

5.4 Attributes

library("UniprotR")
library(Biostrings)
## Loading required package: BiocGenerics
## 
## Attaching package: 'BiocGenerics'
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following objects are masked from 'package:dplyr':
## 
##     combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
##     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
##     get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
##     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
##     tapply, union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
## Loading required package: stats4
## 
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:data.table':
## 
##     first, second
## The following object is masked from 'package:tidyr':
## 
##     expand
## The following objects are masked from 'package:dplyr':
## 
##     first, rename
## The following object is masked from 'package:utils':
## 
##     findMatches
## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname
## Loading required package: IRanges
## 
## Attaching package: 'IRanges'
## The following object is masked from 'package:purrr':
## 
##     reduce
## The following object is masked from 'package:data.table':
## 
##     shift
## The following objects are masked from 'package:dplyr':
## 
##     collapse, desc, slice
## Loading required package: XVector
## 
## Attaching package: 'XVector'
## The following object is masked from 'package:purrr':
## 
##     compact
## Loading required package: GenomeInfoDb
## 
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
## 
##     strsplit
# Define the path to the FASTA file
fasta_file <- "data/UP000002494.fasta"

# Read the FASTA file
fasta_sequences <- readAAStringSet(fasta_file)

# Extract Protein IDs and sequences
protein_ids <- names(fasta_sequences)
sequences <- as.character(fasta_sequences)

# Create a data frame mapping Protein IDs to sequences
protein_sequence_map <- data.frame(ProteinID = protein_ids, Sequence = sequences, stringsAsFactors = FALSE)

# Split the ProteinID by '|' and extract the second element
protein_sequence_map$ProteinID <- sapply(strsplit(protein_sequence_map$ProteinID, "\\|"), `[`, 2)
result_df <- final_df[c("ID", "Protein.Group", "Genes", "First.Protein.Description", "Occurrence")] %>%
  mutate(Protein.ID = Protein.Group) %>%
  separate_rows(Protein.ID, sep = ";") %>%
  distinct() %>%
  left_join(protein_sequence_map, by = c("Protein.ID" = "ProteinID")) 
library("Peptides")
# Function to calculate properties for a sequence
calculate_properties <- function(seq) {
  if (is.na(seq)) {
    return(data.frame(pI = NA, mw = NA, gravy = NA))
  }
  pI_value <- pI(seq = seq, pKscale = "Bjellqvist")
  mw_value <- mw(seq = seq, monoisotopic = FALSE)
  gravy_value <- hydrophobicity(seq = seq, scale = "KyteDoolittle")
  return(data.frame(pI = pI_value, mw = mw_value, gravy = gravy_value))
}

# Apply the function to each sequence in the data frame
properties_df <- do.call(rbind, lapply(result_df$Sequence, calculate_properties))
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
## Warning: Sequence 1 has unrecognized amino acid types. Output value might be
## wrong calculated
# Combine the properties with the original data frame
result_with_properties_df <- cbind(result_df, properties_df)
attributes_df <- result_with_properties_df %>%
  group_by(ID, Protein.Group, Genes, First.Protein.Description, Occurrence) %>%
  summarize(
    mean_pI = mean(pI, na.rm = TRUE),
    mean_mw = mean(mw, na.rm = TRUE),
    mean_gravy = mean(gravy, na.rm = TRUE)
  ) %>%
  ungroup()
## `summarise()` has grouped output by 'ID', 'Protein.Group', 'Genes',
## 'First.Protein.Description'. You can override using the `.groups` argument.
write.csv(attributes_df, file.path(final_out_dir, "attributes_df.csv"), row.names = FALSE)
final_attributes_df <- attributes_df %>%
  inner_join(final_df, by = c("ID", "Protein.Group", "Genes", "First.Protein.Description", "Occurrence"))

write.csv(attributes_df, file.path(final_out_dir, "attributes_statistics_df.csv"), row.names = FALSE)

write_xlsx(final_attributes_df, file.path(final_out_dir, "final_attributes_df.xlsx"))
common_columns <- c("ID", "Protein.Group", "Genes", "First.Protein.Description")
merged_attributes_df <- merge(dataset_filtered_in_3_of_4, attributes_df, by = common_columns)

common_columns <- c("ID", "Protein.Group", "Genes", "First.Protein.Description")
merged_attributes_df_repl <- merge(dataset_filtered_in_3_of_4_repl, attributes_df, by = "ID")

5.5 Visualizations

# Calculate unique proteins for each method for the two conditions
plot_dataset <- rbind(
   dataset_filtered_in_3_of_4_repl[!Method %in% c("1step", "1step+C"), .(Protein_Count = .N, Source = "3 of 4 biol. Replicates"), by = .(Method, Buffer_name, Buffer_type)],
  dataset_filtered_in_3_of_4_repl[, .(Protein_Count = uniqueN(ID)), by = Method][
  , `:=`(Buffer_name = "All Buffers", Buffer_type = "Total", Source = "3 of 4 biol. Replicates")])

# Create and save the barplot
p <- ggplot(plot_dataset, aes(x = Method, y = Protein_Count, fill = Buffer_type)) +
  geom_bar(aes(group = interaction(Buffer_name, Method)), stat = "identity", width = 0.8, position = position_dodge2(width = 0.9, preserve = "single"), alpha = 0.9) +
  geom_text(aes(label = Protein_Count, group = interaction(Buffer_name, Method)), position = position_dodge2(width = 0.9, preserve = "single"), vjust = -0.5, hjust = 0.5, size = 3) +
  #facet_wrap(~ Method, scales = "free_x", ncol = 2) +
  theme_minimal() + # Using a minimal theme for a cleaner look
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Number of Quantified Proteins per Buffer, by Method",
       y = "No. of reproducibly\nquantified proteins", x = "Method", fill = "Buffer Type") +
  scale_fill_manual(values = fraction_colors) + 
  ylim(0, 6000)

ggsave(file.path(final_out_dir, "Proteins_per_Buffer_by_Method_with_Summary.png"), plot = p)
## Saving 7 x 5 in image
print(p)

plot_df <- stats_df %>%
  mutate(Method_Buffer = paste(Method, Buffer_name, sep = "_")) %>%
  mutate(
    BufferType = ifelse(grepl("AI", Buffer_name), "AI",
                        ifelse(grepl("AS", Buffer_name), "AS", "single"))) %>%
  arrange(Method, Buffer_name) %>%
  mutate(Method_Buffer = factor(Method_Buffer, levels = unique(Method_Buffer)))

# Determine positions for vertical lines to separate the methods
method_positions <- plot_df %>%
  group_by(Method) %>%
  summarize(position = min(as.numeric(Method_Buffer)) - 0.5) %>%
  filter(position > 0)

# Create the violin plot with overlaid boxplot and vertical lines
p2 <- ggplot(plot_df, aes(x = Method_Buffer, y = CVLFQ, fill = BufferType)) +
  geom_violin(trim = FALSE) +
  geom_boxplot(width = 0.1, fill = "white", outlier.shape = NA) +  # Add boxplot
  geom_vline(data = method_positions, aes(xintercept = position), linetype = "dotted", color = "black") +
  theme_minimal() +
  labs(title = "Distribution of CVLFQ by Buffer Name and Method",
       x = "Method - Buffer Name",
       y = "CVLFQ") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_manual(values = fraction_colors)

ggsave(file.path(final_out_dir, "Distribution_of_CVLFQ_by_Buffer_Name_and_Method.png"), plot = p2)
## Saving 7 x 5 in image
print(p2)

plot_df <- stats_df %>%
  mutate(Method_Buffer = paste(Method, Buffer_name, sep = "_")) %>%
  mutate(
    Buffer_type = ifelse(grepl("AI", Buffer_name), "AI",
                        ifelse(grepl("AS", Buffer_name), "AS", "single"))) %>%
  arrange(Method, Buffer_name) %>%
  mutate(Method_Buffer = factor(Method_Buffer, levels = unique(Method_Buffer)))

# Determine positions for vertical lines to separate the methods
method_positions <- plot_df %>%
  group_by(Method) %>%
  summarize(position = min(as.numeric(Method_Buffer)) - 0.5) %>%
  filter(position > 0)

# Create the violin plot with overlaid boxplot and vertical lines
p2 <- ggplot(plot_df, aes(x = Method_Buffer, y = CVLFQ, fill = Buffer_type)) +
  geom_violin(trim = FALSE) +
  geom_boxplot(width = 0.1, fill = "white", outlier.shape = NA) +  # Add boxplot
  geom_vline(data = method_positions, aes(xintercept = position), linetype = "dotted", color = "black") +
  theme_minimal() +
  labs(title = "Distribution of CVLFQ by Buffer Name and Method",
       x = "Method - Buffer Name",
       y = "CVLFQ", fill = "Buffer Type") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_manual(values = fraction_colors)

ggsave(file.path(final_out_dir, "Distribution_of_CVLFQ_by_Buffer_Name_and_Method.png"), plot = p2)
## Saving 7 x 5 in image
print(p2)

combined_plot <- p / p2 + plot_layout(ncol = 1) + plot_annotation(tag_levels = 'A') +
  plot_layout(guides = "collect") & 
  theme(legend.position = "bottom")

# Save the combined figure
ggsave(file.path(final_out_dir, "Method_Comparison.png"), plot = combined_plot)
## Saving 7 x 5 in image
print(combined_plot)

5.5.1 Final

# Load necessary libraries
library(ggplot2)
library(patchwork)
library(dplyr)
library(cowplot)
## 
## Attaching package: 'cowplot'
## The following object is masked from 'package:patchwork':
## 
##     align_plots
# Create Method_Buffer column for the first plot dataset
plot_dataset <- rbind(
   dataset_filtered_in_3_of_4_repl[, .(Protein_Count = .N, Source = "3 of 4 biol. Replicates"), by = .(Method, Buffer_name, Buffer_type)],
   dataset_filtered_in_3_of_4_repl[, .(Protein_Count = uniqueN(ID)), by = Method][
    , `:=`(Buffer_name = "Total", Buffer_type = "Total", Source = "3 of 4 biol. Replicates")]
) %>%
  mutate(Method_Buffer = paste(Method, Buffer_name, sep = "_")) %>%
  arrange(Method, Buffer_name) %>%
  mutate(Method_Buffer = factor(Method_Buffer, levels = unique(Method_Buffer)))

# Combine all unique Buffer_name levels
combined_levels <- unique(plot_dataset$Buffer_name)

# Create the mapping for x-axis labels using Buffer_name
buffer_name_labels <- plot_dataset %>%
  select(Method_Buffer, Buffer_name) %>%
  distinct() %>%
  arrange(Method_Buffer)

buffer_name_labels$Buffer_name <- as.character(buffer_name_labels$Buffer_name)
named_labels <- setNames(buffer_name_labels$Buffer_name, buffer_name_labels$Method_Buffer)
named_labels[named_labels == "single"] <- "Single"

plot_dataset$Buffer_type <- factor(plot_dataset$Buffer_type, levels = c("single", "AS", "AI", "Total")) 

plot_dataset_filtered <- plot_dataset %>% filter(Buffer_type != "single")
plot_dataset_filtered$Method_Buffer <- factor(plot_dataset_filtered$Method_Buffer, 
                                              levels = setdiff(levels(plot_dataset_filtered$Method_Buffer), 
                                                               c("1step_single", "1step+C_single")))

# Determine positions for vertical lines to separate the methods
method_positions <- plot_dataset_filtered %>%
  group_by(Method) %>%
  summarize(position = min(as.numeric(Method_Buffer)) - 0.5) %>%
  filter(position > 0)

# Calculate midpoints for each Method to place the Method label centered above the group
method_labels <- plot_dataset_filtered %>%
  group_by(Method) %>%
  summarize(
    midpoint = mean(as.numeric(Method_Buffer))) %>%
  ungroup()

# Create the first plot (barplot)
p1 <- ggplot(plot_dataset_filtered, aes(x = Method_Buffer, y = Protein_Count, fill = Buffer_type)) +
  geom_bar(aes(group = interaction(Buffer_name, Method)), stat = "identity", width = 0.8, position = position_dodge2(width = 0.9, preserve = "single"), color = "black") +
  geom_text(aes(label = Protein_Count, group = interaction(Buffer_name, Method)), position = position_dodge2(width = 0.9, preserve = "single"), vjust = -0.5, hjust = 0.5, size = 4) +
  geom_text(data = method_labels, aes(x = midpoint, y = 6000, label = Method), 
            size = 4, vjust = -0.5, color = "black", inherit.aes = FALSE) +
  theme_minimal() + 
  theme(
    panel.grid = element_blank(),       # Remove grid
    panel.background = element_blank(),  # Remove background
    axis.title.y = element_text(size = 16),         # Change y-axis label size
    axis.text.y = element_text(size = 12),          # Change y-axis tick label size
    axis.text.x = element_text(size = 12, angle = 0), # Change x-axis tick label size
    strip.text.x = element_text(size = 14), 
    panel.spacing = unit(0.1, "lines"), 
    panel.border = element_rect(color = "white", fill = NA, size = 1), 
    legend.position = "none"
  ) +
  labs(
    title = "",
    y = "No. of reproducibly\nquantified proteins", 
    x = "", 
    fill = "Buffer Type"
  ) +
  scale_fill_manual(
    values = setNames(adjustcolor(fraction_colors, alpha.f = 0.7), names(fraction_colors)),
    labels = c("single" = "Single", "AS" = "AS", "AI" = "AI", "Total" = "Total")
  ) +
  scale_x_discrete(labels = named_labels) + 
  ylim(0, 6500) +
  geom_vline(data = method_positions, aes(xintercept = position), linetype = "dotted", color = "black", size = 0.8) +
  geom_hline(yintercept = 0, linetype = "solid", color = "black", size = 0.5) 
## Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Create the second plot (violin plot with boxplot)
plot_df <- stats_df %>%
  mutate(Method_Buffer = paste(Method, Buffer_name, sep = "_")) %>%
  mutate(
    Buffer_type = factor(ifelse(grepl("AI", Buffer_name), "AI",
                                     ifelse(grepl("AS", Buffer_name), "AS", "single")), 
                             levels = c("single", "AS", "AI", "Total"))) %>%
  arrange(Method, Buffer_name) %>%
  mutate(Method_Buffer = factor(Method_Buffer, levels = unique(Method_Buffer)))

# Create the mapping for x-axis labels using Buffer_name
buffer_name_labels <- plot_df %>%
  select(Method_Buffer, Buffer_name) %>%
  distinct() %>%
  arrange(Method_Buffer)

buffer_name_labels$Buffer_name <- as.character(buffer_name_labels$Buffer_name)
named_labels <- setNames(buffer_name_labels$Buffer_name, buffer_name_labels$Method_Buffer)
named_labels[named_labels == "single"] <- "Single"

# Determine positions for vertical lines to separate the methods
method_positions <- plot_df %>%
  group_by(Method) %>%
  summarize(position = min(as.numeric(Method_Buffer)) - 0.5) %>%
  filter(position > 0)

# Calculate midpoints for each Method to place the Method label centered above the group
method_labels <- plot_df %>%
  group_by(Method) %>%
  summarize(
    midpoint = mean(as.numeric(Method_Buffer))) %>%
  ungroup()

p2 <- ggplot(plot_df, aes(x = Method_Buffer, y = CVLFQ, fill = Buffer_type)) +
  geom_violin(trim = FALSE) +
  geom_boxplot(width = 0.1, fill = "white", outlier.shape = NA) +  
  theme_minimal() +
  geom_text(data = method_labels, aes(x = midpoint, y = 220, label = Method), 
            size = 4, vjust = -0.5, color = "black", inherit.aes = FALSE) +
  labs(title = "",
       x = "",
       y = "Coefficient of variation\nin % (maxLfQ)", fill = "Buffer Type") +
  theme(panel.grid = element_blank(), panel.background = element_blank(), 
        axis.title.y = element_text(size = 16),         # Change y-axis label size
        axis.text.y = element_text(size = 12),          # Change y-axis tick label size
        axis.text.x = element_text(size = 12, angle = 0), # Change x-axis tick label size
        strip.text.x = element_text(size = 14), 
        panel.spacing = unit(0.1, "lines"), panel.border = element_rect(color = "white", fill = NA, size = 1), legend.position = "nobe") +
  scale_fill_manual(values = setNames(adjustcolor(fraction_colors, alpha.f = 0.8), names(fraction_colors))) + 
  scale_x_discrete(labels = named_labels) + 
  ylim(0, 240) +
  geom_vline(data = method_positions, aes(xintercept = position), linetype = "dotted", color = "black") +
  geom_hline(yintercept = 0, linetype = "solid", color = "black", size = 0.5)

# Create a data frame just for the legend
legend_data <- data.frame(
  Buffer_type = factor(c("single", "AS", "AI", "Total"), levels = c("single", "AS", "AI", "Total")),
  x = 1:4,  # Arbitrary values for plotting
  y = 1     # Arbitrary values for plotting
)

# Generate the custom legend plot
legend_plot <- ggplot(legend_data, aes(x = x, y = y, fill = Buffer_type)) +
  geom_bar(stat = "identity", color = "black") +
  scale_fill_manual(values = setNames(adjustcolor(fraction_colors, alpha.f = 0.8), names(fraction_colors)),
                    labels = c("Single", "AS", "AI", "Total")) +
  theme_void() +
  theme(legend.position = "bottom") +
  labs(fill = "Buffer Type    ")

legend_grob <- legend_plot + theme(legend.position = "bottom") + guides(fill = guide_legend(nrow = 1))

combined_plot <-((p1 / p2) + 
  plot_layout(heights = c(1, 1)) + 
  plot_annotation(tag_levels = 'A')) / 
  plot_grid(legend_grob, ncol = 1) +
  plot_layout(heights = c(1, 1, 0.1)) 

# Save the combined plot
ggsave(file.path(final_out_dir, "Combined_Proteins_and_CVLFQ_Distribution.png"), plot = combined_plot,
       width = 1148, height = 729, units = "px",  dpi = 100)
## Warning: Removed 289 rows containing missing values or values outside the scale range
## (`geom_violin()`).
# Print the combined plot
print(combined_plot)
## Warning: Removed 289 rows containing missing values or values outside the scale range
## (`geom_violin()`).

# Calculate the median CVLFQ for each Method_Buffer group
median_values <- plot_df %>%
  group_by(Method_Buffer) %>%
  summarize(median_CVLFQ = median(CVLFQ, na.rm = TRUE))

# Display the median values
print(median_values)
## # A tibble: 10 × 2
##    Method_Buffer  median_CVLFQ
##    <fct>                 <dbl>
##  1 1step_single           52.6
##  2 1step+C_single         29.6
##  3 2step_AS               37.0
##  4 2step_AI               33.0
##  5 2step+M_AS             31.4
##  6 2step+M_AI             18.8
##  7 4step_AS1              39.9
##  8 4step_AI1              35.8
##  9 4step_AI2              23.8
## 10 4step_AS2              39.2

6 2Step vs 2Step+

In this analysis, we aim to compare the performance of two methods, 2step and 2step+, across two buffer types: AS and AI. The objective is to evaluate the differences in identified IDs and assess the intersection and unique contributions of each combination. The data has been processed to identify unique IDs across different buffer-method combinations, and several visualizations have been prepared to provide insights.

6.1 IDs per Buffer x Method

We first extract the unique IDs for each buffer and method combination from the filtered dataset. The goal is to analyze the sets of IDs and their overlaps using various visualization techniques.

# Extract unique IDs for each buffer and method combination
buffer_groups <- dataset_filtered_in_3_of_4_repl[, .N, by = .(ID, Buffer_name, Method)]
sets <- list(
  `2step+M AI` = buffer_groups[Buffer_name == "AI" & Method == "2step+M", ID],
  `2step+M AS` = buffer_groups[Buffer_name == "AS" & Method == "2step+M", ID],
  `2step AI` = buffer_groups[Buffer_name == "AI" & Method == "2step", ID],
  `2step AS` = buffer_groups[Buffer_name == "AS" & Method == "2step", ID]
)

6.2 Visualization

6.2.1 Upset

The Upset plot is created to visualize the intersection of IDs among the different buffer-method combinations. It helps in identifying unique and shared IDs across the four sets:

  • 2step+ AI
  • 2step+ AS
  • 2step AI
  • 2step AS

The plot highlights the overlaps and exclusive IDs, providing a clear picture of the distribution of IDs across the different methods and buffers.

library("ComplexUpset")
library("ggplot2")

# Create a data frame to represent the presence/absence of each ID in the sets
all_ids <- unique(unlist(sets))
upset_data <- as.data.frame(sapply(sets, function(x) as.integer(all_ids %in% x)))
upset_data$ID <- all_ids

# Convert to long format for ComplexUpset
upset_data_long <- reshape2::melt(upset_data, id.vars = "ID", variable.name = "Set", value.name = "Presence")
upset_data_long <- upset_data_long[upset_data_long$Presence == 1, ]

# Define the colors for the sets
set_colors <- c("royalblue", "palegreen3", "royalblue", "palegreen3")
names(set_colors) <- names(sets)

# Create queries based on set colors
queries <- lapply(names(set_colors), function(set) {
  upset_query(set = set, fill = set_colors[set])
})

# Create the UpSet plot using ComplexUpset
upset_plot <- ComplexUpset::upset(
  upset_data, 
  names(sets),
  name = "",
  sort_sets = F,
  queries = queries,
  set_sizes=(upset_set_size() + geom_text(aes(label=..count..), hjust=1.1, stat='count') + expand_limits(y=5000))
) + ggtitle("Comparison of AS and AI Buffers Across 2step and 2step+ Methods") 

# Save the plot
ggsave(
  filename = file.path(final_out_dir, "Comparison_of_AS_and_AI_Buffers_Across_2step_and_2step_plus_Methods.png"),
  plot = upset_plot,
  width = 10, 
  height = 6, 
  bg = 'white'
)
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## ℹ The deprecated feature was likely used in the ComplexUpset package.
##   Please report the issue at
##   <https://github.com/krassowski/complex-upset/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Print the plot to the console
print(upset_plot)

6.2.2 Venn Diagrams

To further explore the intersections, Venn diagrams are generated. We create Venn diagrams for combined buffer-method sets as well as separately for 2step and 2step+ methods:

  1. Combined Venn Diagram: This diagram shows the intersections of all four sets.
  2. 2step Method Venn Diagram: This diagram illustrates the intersection between AS and AI buffers within the 2step method.
  3. 2step+ Method Venn Diagram: This diagram illustrates the intersection between AS and AI buffers within the 2step+ method.
library(ggvenn)
## Loading required package: grid
## 
## Attaching package: 'grid'
## The following object is masked from 'package:Biostrings':
## 
##     pattern
# Function to create and save Venn diagram
create_and_save_venn <- function(sets_list, title, sub_dir, filename, colors) {
  mapped_colors <- sapply(names(sets_list), function(x) {
    set_name <- strsplit(x, " ")[[1]][2]  # Split by space and take the second part
    colors[set_name]  # Use the second part to get the color
  }, USE.NAMES = TRUE)
  names(mapped_colors) <- sapply(names(mapped_colors), function(x) {strsplit(x, "\\.")[[1]][1]})
  
  venn_plot <- ggvenn(
    data = sets_list,
    fill_color = unname(mapped_colors),
    fill_alpha = 1,
    stroke_size = 0.5,
    set_name_size = 5,
    text_size = 4
  ) + ggtitle(title) + coord_fixed(expand = FALSE, xlim = c(-2.2, 2.2), ylim = c(-2, 1.7))

  # Save the Venn diagram
  ggsave(file.path(sub_dir, filename), plot = venn_plot, width = 7.5, height = 6, bg = 'white')
  return(venn_plot)
}
venn_combined <- create_and_save_venn(
  sets_list = list(`2step AS` = sets$`2step AS`, `2step+ AS` = sets$`2step+ AS`, `2step AI` = sets$`2step AI`, `2step+ AI` = sets$`2step+ AI`),
  title = "Intersection of AS and AI Buffers Across 2step and 2step+ Methods",
  sub_dir = final_out_dir,
  filename = "VennDiagram_Combined.png",
  colors = fraction_colors
)
## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.
# Create and save Venn diagram for "2step"
venn1 <- create_and_save_venn(
  sets_list = list(`2step AS` = sets$`2step AS`, `2step AI` = sets$`2step AI`),
  title = "Intersection of AS and AI in 2step Method",
  sub_dir = final_out_dir,
  filename = "VennDiagram_2step.png",
  colors = fraction_colors
)
## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.
# Create and save Venn diagram for "2step+"
venn2 <- create_and_save_venn(
  sets_list = list(`2step+ AS` = sets$`2step+ AS`, `2step+ AI` = sets$`2step+ AI`),
  title = "Intersection of AS and AI in 2step+ Method",
  sub_dir = final_out_dir,
  filename = "VennDiagram_2step_plus.png",
  colors = fraction_colors
)
## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.
library(VennDiagram)
## Loading required package: futile.logger
library(ggplotify)
library(grid)

# Function to create and save Venn diagram
create_and_save_venn <- function(ai_ids, as_ids, method_label, sub_dir, filename) {
  # Suppress warnings temporarily
  oldw <- getOption("warn")
  options(warn = -1)
  
  venn.plot <- venn.diagram(
    x = list(
      AS = unique(na.omit(as_ids)),
      AI = unique(na.omit(ai_ids))
    ),
    category.names = c("AS", "AI"),
    filename = NULL,
    fill = c("palegreen3", "royalblue"),
    alpha = 0.5,
    cex = 2,
    cat.cex = 2,
    cat.pos = c(20, -30),
    cat.dist = 0.05,
    cat.col = c("palegreen3", "royalblue"),
    margin = 0.1,
    inverted = T
  )

  # Save the Venn diagram
  png(filename = file.path(sub_dir, filename), width = 800, height = 600)
  grid.newpage()
  grid.draw(venn.plot)
  grid.text(paste("Intersection of AS and AI in", method_label, "Method"), x = 0.5, y = 0.95, gp = gpar(fontsize = 16, fontface = "bold"))
  dev.off()
  
  # Restore warning settings
  options(warn = oldw)
  return(venn.plot)
}
# Create and save Venn diagram for "2step"
venn1 <- create_and_save_venn(sets$`2step AI`, sets$`2step AS`, "2step", final_out_dir, "VennDiagram_2step.png")
venn1.ggplot <- as.ggplot(~grid.draw(venn1)) #+ ggtitle("Intersection of AS and AI in 2step Method")

# Create and save Venn diagram for "2step+"
venn2 <- create_and_save_venn(sets$`2step+M AI`, sets$`2step+M AS`, "2step+M", final_out_dir, "VennDiagram_2step_plus.png")
venn2.ggplot <- as.ggplot(~grid.draw(venn2))# + ggtitle("Intersection of AS and AI in 2step+ Method")

6.2.3 Violin

A split violin plot is created to visualize the distribution of log-transformed Mean LFQ values by buffer type, separated by method (2step vs 2step+). This plot helps in comparing the intensity distributions between the two methods for each buffer type.

GeomSplitViolin <- ggproto("GeomSplitViolin", GeomViolin, 
                           draw_group = function(self, data, ..., draw_quantiles = NULL) {
  data <- transform(data, xminv = x - violinwidth * (x - xmin), xmaxv = x + violinwidth * (xmax - x))
  grp <- data[1, "group"]
  newdata <- plyr::arrange(transform(data, x = if (grp %% 2 == 1) xminv else xmaxv), if (grp %% 2 == 1) y else -y)
  newdata <- rbind(newdata[1, ], newdata, newdata[nrow(newdata), ], newdata[1, ])
  newdata[c(1, nrow(newdata) - 1, nrow(newdata)), "x"] <- round(newdata[1, "x"])

  if (length(draw_quantiles) > 0 & !scales::zero_range(range(data$y))) {
    stopifnot(all(draw_quantiles >= 0), all(draw_quantiles <=
      1))
    quantiles <- ggplot2:::create_quantile_segment_frame(data, draw_quantiles)
    aesthetics <- data[rep(1, nrow(quantiles)), setdiff(names(data), c("x", "y")), drop = FALSE]
    aesthetics$alpha <- rep(1, nrow(quantiles))
    both <- cbind(quantiles, aesthetics)
    quantile_grob <- GeomPath$draw_panel(both, ...)
    ggplot2:::ggname("geom_split_violin", grid::grobTree(GeomPolygon$draw_panel(newdata, ...), quantile_grob))
  }
  else {
    ggplot2:::ggname("geom_split_violin", GeomPolygon$draw_panel(newdata, ...))
  }
})

geom_split_violin <- function(mapping = NULL, data = NULL, stat = "ydensity", position = "identity", ..., 
                              draw_quantiles = NULL, trim = TRUE, scale = "area", na.rm = FALSE, 
                              show.legend = NA, inherit.aes = TRUE) {
  layer(data = data, mapping = mapping, stat = stat, geom = GeomSplitViolin, 
        position = position, show.legend = show.legend, inherit.aes = inherit.aes, 
        params = list(trim = trim, scale = scale, draw_quantiles = draw_quantiles, na.rm = na.rm, ...))
}
plot_df <- stats_df[stats_df$Method %in% c("2step", "2step+M"),]

# Create the split violin plot with log-transformed MeanLFQ
violin <- ggplot(plot_df, aes(x = Method, y = log10(MeanLFQ), fill = Buffer_name)) +
  geom_split_violin() +
  labs(title = "", x = "Method", y = "loh10 maxLfQ intensities") +
  theme_minimal() +
  scale_fill_manual(values = setNames(adjustcolor(fraction_colors, alpha.f = 0.7), names(fraction_colors)))
 # scale_fill_manual(values = c("royalblue", "palegreen3"))


ggsave(file.path(final_out_dir, "Violin_BufferTypes.png"), plot = p, width = 10, height = 6, bg = 'white')

Finally, the individual plots are combined into a single figure for a comprehensive visual summary of the analysis. The combined plot includes the Venn diagrams, Upset plot, and violin plot.

# Combining the plots
combined_plot <- (venn1.ggplot | venn2.ggplot | violin) / (upset_plot) + plot_annotation(tag_levels = 'A')


# Save the combined figure
ggsave(file.path(final_out_dir, "2step_vs_2stepplus.png"), plot = combined_plot, width = 12, height = 6, dpi = 100)

print(combined_plot)

6.2.3.1 Final

library(VennDiagram)
library(ggplotify)
library(grid)

# Function to create and save Venn diagram
create_and_save_venn <- function(ai_ids, as_ids, method_label) {
  # Suppress warnings temporarily
  oldw <- getOption("warn")
  options(warn = -1)
  
  venn.plot <- venn.diagram(
    x = list(
      AS = unique(as_ids),
      AI = unique(ai_ids)
    ),
    category.names = c("AS", "AI"),
    filename = NULL,
    fill = c("palegreen3", "royalblue"),
    alpha = 0.7,
    cex = 1.2,
    fontfamily = "sans",  # Set font to match other plots
    cat.cex = 1.2,
    cat.pos = c(20, -30),
    cat.dist = 0.05,
    cat.col = c("palegreen3", "royalblue"),
    cat.fontfamily = "sans",
    margin = 0.01,
    inverted = T
  )
  
  # Restore warning settings
  options(warn = oldw)
  return(venn.plot)
}
# Create and save Venn diagram for "2step"
venn1 <- create_and_save_venn(sets$`2step AI`, sets$`2step AS`, "2step")
venn1.ggplot <- as.ggplot(~grid.draw(venn1)) #+ ggtitle("Intersection of AS and AI in 2step Method")

# Create and save Venn diagram for "2step+"
venn2 <- create_and_save_venn(sets$`2step+M AI`, sets$`2step+M AS`, "2step+")
venn2.ggplot <- as.ggplot(~grid.draw(venn2))# + ggtitle("Intersection of AS and AI in 2step+ Method")


plot_df <- stats_df[stats_df$Method %in% c("2step", "2step+M"),]

breaks <- 10^(-10:10)
minor_breaks <- rep(1:9, 21)*(10^rep(-10:10, each=9))

# Create the split violin plot with log-transformed MeanLFQ
violin <- ggplot(plot_df, aes(x = Method, y = MeanLFQ, fill = Buffer_name)) +
  geom_split_violin() +
  labs(title = "", x = "", y = "log10 maxLfQ intensities", fill = "Buffer Type") +
  theme_minimal() +
  scale_fill_manual(values = setNames(adjustcolor(fraction_colors, alpha.f = 0.7), names(fraction_colors))) +
  theme(
    text = element_text(size = 14),
    axis.text.x = element_text(size = 14),  # Larger x-axis labels
    axis.text.y = element_text(size = 10),  # Log10 scaling applied
    panel.grid.minor.y = element_line(color="grey", linetype="dotted"), 
    panel.grid.major.y = element_line(color="grey", linetype="solid")
  ) + scale_y_log10(breaks = breaks, minor_breaks = minor_breaks, labels = label_number()) +
  annotation_logticks(sides="l")


library("ComplexUpset")
library("ggplot2")

# Create a data frame to represent the presence/absence of each ID in the sets
all_ids <- unique(unlist(sets))
upset_data <- as.data.frame(sapply(sets, function(x) as.integer(all_ids %in% x)))
upset_data$ID <- all_ids

# Convert to long format for ComplexUpset
upset_data_long <- reshape2::melt(upset_data, id.vars = "ID", variable.name = "Set", value.name = "Presence")
upset_data_long <- upset_data_long[upset_data_long$Presence == 1, ]

# Define the colors for the sets
set_colors <- c("royalblue", "palegreen3", "royalblue", "palegreen3")
names(set_colors) <- names(sets)

# Create queries based on set colors
queries <- lapply(names(set_colors), function(set) {
  upset_query(set = set, fill = set_colors[set])
})


# Create the UpSet plot using ComplexUpset
upset_plot <- ComplexUpset::upset(
  upset_data, 
  names(sets),
  name = "",
  sort_sets = F,
  base_annotations=list(
        'No. of\nintersecting\nproteins'=intersection_size(counts=TRUE) +
          theme(axis.text.y = element_text(size = 12),  # Increase y-axis annotation size
          axis.title.y = element_text(size = 13))  # Increase y-axis title size
    ),
  queries = queries,
  set_sizes=(upset_set_size() + geom_text(aes(label=..count..), hjust=1.1, stat='count', size = 4) + expand_limits(y=5000) +
               theme(
      axis.title.x = element_text(size = 13),   # Set size X-axis title size
      axis.text.x = element_text(size = 12)    # X-axis tick labels size
    ))
) + theme(text = element_text(size = 14),
          axis.title.x = element_text(size = 12)) 

# Save the plot
ggsave(
  filename = file.path(final_out_dir, "Comparison_of_AS_and_AI_Buffers_Across_2step_and_2step_plus_Methods.png"),
  plot = upset_plot,
  width = 10, 
  height = 6, 
  bg = 'white'
)

# Print the plot to the console
print(upset_plot)

# Combining the plots
combined_plotx <- (venn1.ggplot | venn2.ggplot | violin) / (as.ggplot(upset_plot)) + plot_annotation(tag_levels = 'A') +
  plot_layout(guides = "collect") & theme(legend.position = "bottom")


# Save the combined figure
ggsave(file.path(final_out_dir, "2step_vs_2stepplus.png"), plot = combined_plotx, width = 12, height = 8, dpi = 200)

print(combined_plotx)

6.2.4 Attribute Violin

6.2.4.1 test

combined_df <- stats_df %>% arrange(Protein.Group, Genes, Method, Buffer_name)
filtered_df <- combined_df %>%
  filter(Method %in% c("2step", "2step+M")) %>%
  mutate(Protein.ID = Protein.Group) %>%
  separate_rows(Protein.ID, sep = ";") %>%
  left_join(result_with_properties_df, by=c("ID","Protein.Group","Genes","Protein.ID"))

# Identify IDs uniquely found in either "2step" or "2step+" methods or shared
unique_ids <- filtered_df %>%
  group_by(ID) %>%
  summarize(Methods = n_distinct(Method)) %>%
  ungroup()

# Join the unique_ids back to the filtered_df to retain the Method column
unique_ids <- filtered_df %>%
  left_join(unique_ids, by = "ID")

# Classify IDs
filtered_df <- unique_ids %>%
  mutate(Category = ifelse(Methods == 1,
                           ifelse(Method == "2step", "Only 2step",
                                  ifelse(Method == "2step+M", "Only 2step+M", NA)),
                           ifelse(Methods > 1, "Shared", NA)))


# Categorize by buffer type
buffer_categories <- filtered_df %>%
  group_by(ID, Method) %>%
  summarize(Buffers = n_distinct(Buffer_name)) %>%
  ungroup() 
## `summarise()` has grouped output by 'ID'. You can override using the `.groups`
## argument.
buffer_categories <- filtered_df %>%
  left_join(buffer_categories, by = c("ID", "Method"))

filtered_df <- buffer_categories %>%
  mutate(Buffer_Category = ifelse(Buffers == 1,
                                  ifelse(Buffer_name == "AI", "Only AI",
                                         ifelse(Buffer_name == "AS", "Only AS", NA)),
                                  ifelse(Buffers > 1, "Both", NA)))


# Summarize data for plotting
plot_data <- filtered_df %>%
  group_by(Category, Buffer_Category) %>%
  summarize(Count = n_distinct(Protein.ID))
## `summarise()` has grouped output by 'Category'. You can override using the
## `.groups` argument.
# Filter the original filtered_df based on unique IDs
filtered_df2 <- filtered_df %>%
  filter(Category != "Shared")

melted_df <- melt(filtered_df2, id.vars = c("ID", "Protein.Group", "Genes", "Method", "Buffer_name"),
                  measure.vars = c("pI", "mw", "gravy"),
                  variable.name = "Property",
                  value.name = "Value")
## Warning: The melt generic in data.table has been passed a tbl_df and will
## attempt to redirect to the relevant reshape2 method; please note that reshape2
## is superseded and is no longer actively developed, and this redirection is now
## deprecated. To continue using melt methods from reshape2 while both libraries
## are attached, e.g. melt.list, you can prepend the namespace, i.e.
## reshape2::melt(filtered_df2). In the next version, this warning will become an
## error.
# Function to plot each property
plot_property <- function(df, property, log_scale = FALSE) {
  p <- ggplot(df %>% filter(Property == property), aes(x = Buffer_name, y = Value, fill = Method)) +
    geom_split_violin() +
    theme_minimal() +
    scale_fill_manual(values = c("#E69F00", "#56B4E9"))
  
  if (log_scale) {
    p <- p + scale_y_log10() +
      labs(title = paste("Log10", property, "by Buffer Type"), x = "Buffer Type", y = property) +
     theme(
    text = element_text(size = 14),
    axis.text.x = element_text(size = 16),  # Larger x-axis labels
    axis.text.y = element_text(size = 10),  # Log10 scaling applied
    panel.grid.minor.y = element_line(color="grey", linetype="dotted"), 
    panel.grid.major.y = element_line(color="grey", linetype="solid")
  ) + scale_y_log10(breaks = breaks, minor_breaks = minor_breaks, labels = label_number()) +
  annotation_logticks(sides="l")
  } else {
    p <- p + labs(title = paste(property, "by Buffer Type"), x = "Buffer Type", y = property) 
  }
  
  return(p)
}

# Plot for pI
plot_pI <- plot_property(melted_df, "pI")
print(plot_pI)
## Warning: Removed 19 rows containing non-finite outside the scale range
## (`stat_ydensity()`).

# Plot for mw with logarithmic scale
plot_mw <- plot_property(melted_df, "mw", log_scale = TRUE)
## Scale for y is already present.
## Adding another scale for y, which will replace the existing scale.
print(plot_mw)
## Warning: Removed 19 rows containing non-finite outside the scale range
## (`stat_ydensity()`).

# Plot for gravy
plot_gravy <- plot_property(melted_df, "gravy")
print(plot_gravy)
## Warning: Removed 19 rows containing non-finite outside the scale range
## (`stat_ydensity()`).

6.2.4.2 Mean per Protein.Group

library(scales)
plot_df <- merged_attributes_df_repl[merged_attributes_df_repl$Method %in% c("2step", "2step+M")]

# Function to plot each property
plot_property <- function(df, property, ylab, log_scale = FALSE) {
  p <- ggplot(df, aes(x = Buffer_name, y = get(property), fill = Method)) +
    geom_split_violin() +
    theme_minimal() +
    scale_fill_manual(values = c("#E69F00", "#56B4E9")) +
    theme(axis.title.x = element_blank(), legend.position = "none") +
    theme(text = element_text(size = 14)) 
  
  if (log_scale) {
    p <- p + scale_y_log10(labels = label_number()) +
      labs(title = "", x = "Buffer Type", y = ylab) +
     theme(
    text = element_text(size = 14),
    axis.text.x = element_text(size = 16),  # Larger x-axis labels
    axis.text.y = element_text(size = 10),  # Log10 scaling applied
    panel.grid.minor.y = element_line(color="grey", linetype="dotted"), 
    panel.grid.major.y = element_line(color="grey", linetype="solid")
  ) + scale_y_log10(breaks = breaks, minor_breaks = minor_breaks, labels = label_number()) +
  annotation_logticks(sides="l")
  } else {
    p <- p + labs(title = "", x = "Buffer Type", y = ylab) 
  }
  
  return(p)
}

plot_df$mean_mw <- plot_df$mean_mw/1000
# Plot for pI
plot_pI <- plot_property(plot_df, "mean_pI", ylab = "Isoelectric point (pI)")
print(plot_pI)
## Warning: Removed 110 rows containing non-finite outside the scale range
## (`stat_ydensity()`).

# Plot for mw with logarithmic scale
plot_mw <- plot_property(plot_df, "mean_mw", ylab = "Molecular weight [kDa]", log_scale = TRUE)
## Scale for y is already present.
## Adding another scale for y, which will replace the existing scale.
print(plot_mw)
## Warning: Removed 110 rows containing non-finite outside the scale range
## (`stat_ydensity()`).

# Plot for gravy
plot_gravy <- plot_property(plot_df, "mean_gravy", ylab = "GRAVY score")
print(plot_gravy)
## Warning: Removed 110 rows containing non-finite outside the scale range
## (`stat_ydensity()`).

# Combine the plots using patchwork and add a global title
combined_plot <- (plot_pI | plot_mw | plot_gravy) #+ 
  #plot_layout(guides = "collect") & 
  #theme(legend.position = "bottom")

# Add overall title
#combined_plot <- combined_plot + 
#  plot_annotation(title = "Protein Properties by Buffer Type and Method across all Protein Groups", theme = theme(plot.title = element_text(size = 16, face = "bold")))

# Display the combined plot
print(combined_plot)
## Warning: Removed 110 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Removed 110 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Removed 110 rows containing non-finite outside the scale range
## (`stat_ydensity()`).

# Filter data for IDs that are either only in AI or AS Buffer_type, but not shared inside the method
filter_ids <- function(dt, method) {
  method_dt <- dt[Method == method]
  only_ai <- method_dt[Buffer_type == "AI" & !(ID %in% method_dt[Buffer_type == "AS", ID])]
  only_as <- method_dt[Buffer_type == "AS" & !(ID %in% method_dt[Buffer_type == "AI", ID])]
  rbind(only_ai, only_as)
}

# Function to plot each property
plot_property <- function(df, property, ylab, log_scale = FALSE) {
  p <- ggplot(df, aes(x = Buffer_name, y = get(property), fill = Method)) +
    geom_split_violin() +
    theme_minimal() +
    scale_fill_manual(values = c("#E69F00", "#56B4E9")) +
    theme(axis.title.x = element_blank(), legend.position = "none") +
    theme(text = element_text(size = 14)) 
  
  if (log_scale) {
    p <- p + scale_y_log10(labels = label_number()) +
      labs(title = "", x = "Buffer Type", y = ylab) +
     theme(
    text = element_text(size = 14),
    axis.text.x = element_text(size = 14),  # Larger x-axis labels
    axis.text.y = element_text(size = 10),  # Log10 scaling applied
    panel.grid.minor.y = element_line(color="grey", linetype="dotted"), 
    panel.grid.major.y = element_line(color="grey", linetype="solid")
  ) + scale_y_log10(breaks = breaks, minor_breaks = minor_breaks, labels = label_number()) +
  annotation_logticks(sides="l")
  } else {
    p <- p + labs(title = "", x = "Buffer Type", y = ylab) +
     theme(
    text = element_text(size = 14),
    axis.text.x = element_text(size = 14),  # Larger x-axis labels
    axis.text.y = element_text(size = 10)
  )
  }
  
  return(p)
}

filtered_plot_df <- bind_rows(filter_ids(plot_df, "2step"), filter_ids(plot_df, "2step+M"))

# Plot for pI
plot_pI_filtered <- plot_property(filtered_plot_df, "mean_pI", ylab = "Isoelectric point (pI)")

# Plot for mw with logarithmic scale
plot_mw_filtered <- plot_property(filtered_plot_df, "mean_mw", ylab = "Molecular weight [kDa]", log_scale = TRUE) 
## Scale for y is already present.
## Adding another scale for y, which will replace the existing scale.
breaks <- 10^(-10:10)
minor_breaks <- rep(1:9, 21)*(10^rep(-10:10, each=9))


# Plot for gravy
plot_gravy_filtered <- plot_property(filtered_plot_df, "mean_gravy", ylab = "GRAVY score")

# Combine the plots using patchwork and add a global title
combined_plot_filtered <- (plot_pI_filtered | plot_mw_filtered | plot_gravy_filtered) + 
  plot_layout(guides = "collect") & 
  theme(legend.position = "bottom")

# Add overall title
#combined_plot_filtered <- combined_plot_filtered + 
#  plot_annotation(title = "Protein Properties by Buffer Type and Method across unique Protein Groups per Buffer Type", theme = theme(plot.title = #element_text(size = 16, face = "bold")))

# Display the combined plot
print(combined_plot_filtered)
## Warning: Removed 40 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Removed 40 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Removed 40 rows containing non-finite outside the scale range
## (`stat_ydensity()`).

final_combined_plot <- (combined_plot / combined_plot_filtered) + plot_annotation(tag_levels = 'A') +
  plot_layout(guides = "collect") & theme(legend.position = "bottom")
#  plot_annotation(title = "Comparison of Protein Properties in Full and Filtered Data", theme = theme(plot.title = element_text(size = 18, face = "bold")),
#                  tag_levels = 'A') 

ggsave(file.path(final_out_dir, "Combined_Attributes.png"), plot = final_combined_plot, width = 8, height = 6)
## Warning: Removed 110 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Removed 110 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Removed 110 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Warning: Removed 40 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Removed 40 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Removed 40 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
# Display the final combined plot
print(final_combined_plot)
## Warning: Removed 110 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Warning: Removed 110 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Removed 110 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Warning: Removed 40 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Removed 40 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Removed 40 rows containing non-finite outside the scale range
## (`stat_ydensity()`).

## Final

last_combined_plot <- ((venn1.ggplot | venn2.ggplot | violin) / (as.ggplot(upset_plot)) / (plot_pI_filtered | plot_mw_filtered | plot_gravy_filtered)) +
  plot_annotation(tag_levels = 'A') +
  plot_layout(guides = "collect") & 
   theme(
    legend.position = "bottom",
    plot.tag = element_text(size = 14)  # Adjust size as needed for tag consistency
  )

ggsave(file.path(final_out_dir, "Combined_2step2step+.png"), plot = last_combined_plot, width =14, height = 12, dpi = 300)
## Warning: Removed 40 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Removed 40 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Removed 40 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
# Display the final combined plot
print(last_combined_plot)
## Warning: Removed 40 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Removed 40 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Removed 40 rows containing non-finite outside the scale range
## (`stat_ydensity()`).

# Calculate the number of unique IDs per method and per buffer type for filtered_plot_df
id_counts_filtered <- filtered_plot_df[, .N, by = .(Method, Buffer_type)]
setnames(id_counts_filtered, "N", "ID_Count")

# Calculate the number of unique IDs per method and per buffer type for plot_df
id_counts_full <- plot_df[, .N, by = .(Method, Buffer_type)]
setnames(id_counts_full, "N", "ID_Count")

# Create bar plot for filtered_plot_df
plot_filtered <- ggplot(id_counts_filtered, aes(x = Method, y = ID_Count, fill = Buffer_type)) +
  geom_bar(stat = "identity", position = position_dodge(width = 0.8)) +
  geom_text(aes(label = ID_Count), hjust = 1.1, position = position_dodge(width = 0.8)) +
  scale_fill_manual(values = fraction_colors) +
  labs(title = "",
       x = "Method",
       y = "Count of Unique IDs") +
  theme_minimal() +
  coord_flip()

# Create bar plot for plot_df
plot_full <- ggplot(id_counts_full, aes(x = Method, y = ID_Count, fill = Buffer_type)) +
  geom_bar(stat = "identity", position = position_dodge(width = 0.8)) +
  geom_text(aes(label = ID_Count), hjust = 1.1, position = position_dodge(width = 0.8)) +
  scale_fill_manual(values = fraction_colors) +
  labs(title = "",
       x = "Method",
       y = "Count of Unique IDs") +
  theme_minimal() +
  coord_flip()

# Display the plots
print(plot_filtered)

print(plot_full)

final_combined_plot <- ((combined_plot | plot_full) / 
                        (combined_plot_filtered | plot_filtered)) + 
  plot_layout(guides = "collect") &
  plot_annotation(title = "Comparison of Protein Properties in Full and Filtered Data", 
                  theme = theme(plot.title = element_text(size = 18, face = "bold")), 
                  tag_levels = 'A')


# Display the final combined plot
print(final_combined_plot)
## Warning: Removed 110 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Removed 110 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Removed 110 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Warning: Removed 40 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Removed 40 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Removed 40 rows containing non-finite outside the scale range
## (`stat_ydensity()`).

7 2step+ alone

7.1 IDs per Technical Replicate and Buffer

dataset_2step_plus <- dataset_merged[Method == "2step+M" & !is.na(maxLfQ_intensities)]
dataset_2step_plus_repl <- dataset_2step_plus[, .(Unique_Repl_Biol_Count = uniqueN(Repl_Biol)), by = .(Method, Buffer_name, Buffer_type, Repl_Techn, ID)]
dataset_2step_plus_in_3_of_4 <- merge(
  dataset_2step_plus, 
  dataset_2step_plus_repl[Unique_Repl_Biol_Count >= 3][, .(Method, Buffer_name, Repl_Techn, ID)], 
  by = c("Method", "Buffer_name", "Repl_Techn", "ID")
)
dataset_2step_plus_in_3_of_4_all_techn_per_buffer <- dataset_2step_plus_in_3_of_4[, if(uniqueN(Repl_Techn) == 4) .SD, by = .(ID, Buffer_name, Method)]

# counts as == 4 even if in AS == (1,2,3) and in AI == (2,3,4)
dataset_2step_plus_in_3_of_4_all_techn <- dataset_2step_plus_in_3_of_4[, if(uniqueN(Repl_Techn) == 4) .SD, by = .(ID, Method)]

7.2 Statistics

stats_df_bio <- dataset_2step_plus_in_3_of_4 %>%
  group_by(ID, Protein.Group, Genes, First.Protein.Description, Method, Buffer_name, Repl_Techn) %>%
  summarize(
    #MedianLFQ = median(maxLfQ_intensities, na.rm = TRUE),
    MeanLFQ = mean(maxLfQ_intensities, na.rm = TRUE),
    SDLFQ = sd(maxLfQ_intensities, na.rm = TRUE),
    CVLFQ = (sd(maxLfQ_intensities, na.rm = TRUE) / mean(maxLfQ_intensities, na.rm = TRUE)) * 100,
    .groups = 'drop') %>%
  dplyr::rename(`Replicate` = `Repl_Techn`) %>%
  mutate(`Variance_Type` = "Biological Variance")

stats_df_tech <- dataset_2step_plus_in_3_of_4 %>%
  group_by(ID, Protein.Group, Genes, First.Protein.Description, Method, Buffer_name, Repl_Biol) %>%
  summarize(
   # MedianLFQ = median(maxLfQ_intensities, na.rm = TRUE),
    MeanLFQ = mean(maxLfQ_intensities, na.rm = TRUE),
    SDLFQ = sd(maxLfQ_intensities, na.rm = TRUE),
    CVLFQ = (sd(maxLfQ_intensities, na.rm = TRUE) / mean(maxLfQ_intensities, na.rm = TRUE)) * 100,
    .groups = 'drop') %>%
  dplyr::rename(`Replicate` = `Repl_Biol`) %>%
  mutate(`Variance_Type` = "Technical Variance")

combined_variance_stats <- bind_rows(stats_df_tech, stats_df_bio)
# Summarize the dataset
mean_variance_stats <- combined_variance_stats %>%
  group_by(ID, Protein.Group, Genes, First.Protein.Description, Method, Buffer_name, Variance_Type) %>%
  summarize(
    #MedianLFQ = mean(MedianLFQ, na.rm = TRUE),
    MeanLFQ = mean(MeanLFQ, na.rm = TRUE),
    SDLFQ = mean(SDLFQ, na.rm = TRUE),
    CVLFQ = mean(CVLFQ, na.rm = TRUE)
  ) %>%
  ungroup()
## `summarise()` has grouped output by 'ID', 'Protein.Group', 'Genes',
## 'First.Protein.Description', 'Method', 'Buffer_name'. You can override using
## the `.groups` argument.

7.2.1 Calculate

First, we calculate various statistics for each sample type. This includes median LFQ, mean LFQ, standard deviation, and coefficient of variation. Additionally, we extract the intensities for each biological replicate.

# Calculate statistics for each sample type
stats_df_2plus <- dataset_2step_plus_in_3_of_4 %>%
  group_by(ID, Protein.Group, Genes, First.Protein.Description, Method, Buffer_name, Repl_Techn) %>%
  summarize(
    #MedianLFQ = median(maxLfQ_intensities, na.rm = TRUE),
    MeanLFQ = mean(maxLfQ_intensities, na.rm = TRUE),
    SDLFQ = sd(maxLfQ_intensities, na.rm = TRUE),
    CVLFQ = (sd(maxLfQ_intensities, na.rm = TRUE) / mean(maxLfQ_intensities, na.rm = TRUE)) * 100,
    .groups = 'drop')# %>%
  #dplyr::rename(`Replicate` = `Repl_Techn`)

# Calculate the intensities for each biological replicate
replicate_intensities <- dataset_2step_plus %>%
  group_by(ID, Protein.Group, Genes, First.Protein.Description, M_value, Buffer_name, Repl_Techn, Repl_Biol) %>%
  summarize(Intensity = mean(maxLfQ_intensities, na.rm = TRUE), .groups = 'drop') %>%
  mutate(Tech = paste0("T", Repl_Techn)) %>%
  mutate(B = paste0("B", Repl_Biol)) %>%
  unite("T_B_Repl", Tech, Buffer_name, B, sep = "-") %>%
  select(-Repl_Biol, -Repl_Techn) %>%
  pivot_wider(names_from = T_B_Repl, values_from = Intensity)


# Add a new column M-B to stats_df using the mapping
stats_df_2plus_long <- stats_df_2plus %>%
  mutate(T_B = paste0("T", Repl_Techn, "-", Buffer_name)) %>%
  select(-Repl_Techn, -Buffer_name, -Method) %>%
  pivot_longer(cols = c(MeanLFQ, SDLFQ, CVLFQ), names_to = "Metric", values_to = "Value") %>%
  unite("T_B_Metric", T_B, Metric, sep = "_") %>%
  pivot_wider(names_from = T_B_Metric, values_from = Value) #%>%
  #relocate(Occurrence, .after = First.Protein.Description) # Moves Occurrence to the 5th position

# Merge with replicate intensities and arrange columns
final_df_2plus <- stats_df_2plus_long %>%
  right_join(replicate_intensities, by = c("ID", "Protein.Group", "Genes", "First.Protein.Description")) %>%
  arrange(ID)

7.2.2 Sort

Next, we dynamically generate the column order to ensure the columns are sorted first by replicate and then by metrics for each Method x Buffer combination.

# Filter to only include Repl_Techn == 1
ordered_meta_copy <- ordered_meta_data %>%
  #filter(Repl_Techn == 1) %>%
  filter(Method == "2step+M") %>%
  arrange(Repl_Techn, Buffer_name, Repl_Biol)

# Function to generate columns for each Method x Buffer combination
generate_columns <- function(buffer, t_value) {
  replicate_cols <- paste0("T", t_value, "-", buffer, "-B", 1:4)
  metric_cols <- paste0("T", t_value, "-", buffer, "_", c("MeanLFQ", "SDLFQ", "CVLFQ"))
  c(replicate_cols, metric_cols)
}

method_buffer_combinations <- ordered_meta_copy %>%
  distinct(Repl_Techn, Buffer_name, M_value)

# Generate the column order dynamically
all_cols <- unlist(lapply(1:nrow(method_buffer_combinations), function(i) {
  generate_columns(method_buffer_combinations$Buffer_name[i], 
                   method_buffer_combinations$Repl_Techn[i])
}))

# Ensure that all_cols contains only the columns present in final_df
all_cols <- c("ID", "Protein.Group", "Genes", "First.Protein.Description", all_cols[all_cols %in% names(final_df_2plus)])

# Select columns in the specified order
final_df_2plus <- final_df_2plus %>%
  select(all_of(all_cols))

7.2.3 Save

final_attributes_2plus_df <- attributes_df %>% select(-Occurrence) %>%
  inner_join(final_df_2plus, by = c("ID", "Protein.Group", "Genes", "First.Protein.Description"))


write_xlsx(final_attributes_2plus_df, file.path(final_out_dir, "final_2plus_attributes_df.xlsx"))

7.3 Visualizations

7.3.1 Upset

binary_presence <- dataset_2step_plus_in_3_of_4 %>%
  distinct(ID, Repl_Techn) %>%
  pivot_wider(names_from = Repl_Techn, values_from = Repl_Techn, 
              values_fill = list(Repl_Techn = 0), 
              values_fn = list(Repl_Techn = function(x) 1))  %>%
  tibble::column_to_rownames(var = "ID")

binary_presence$all <- 1

binary_presence <- binary_presence[, c(4,3,2,1,5)]

upset_plot <- ComplexUpset::upset(data=binary_presence, 
                    intersect=names(binary_presence), name="", 
                     wrap=TRUE,
                    themes=upset_modify_themes(list('intersections_matrix'=theme(axis.text.y=element_text(size=12)))),
                    base_annotations=list(
                        'No. of\nintersecting\nproteins'=intersection_size(counts=TRUE, bar_number_threshold = 1) + 
                          expand_limits(y = 4000) +
                           theme(
                              #panel.grid.major = element_blank(),    # Remove major gridlines
                              panel.grid.minor = element_blank(),    # Remove minor gridlines
                             axis.text.y = element_text(size = 11),  # Increase y-axis annotation size
                             axis.title.y = element_text(size = 13))),  # Increase y-axis title size),
                    set_sizes=(
                        upset_set_size()
                        + geom_text(aes(label=..count..), hjust=1.1, stat='count', size = 3.5)
                        + annotate(geom='text', label='@', x='Count', y=850, color='white', size=4)
                        + expand_limits(y=7000) + 
                          theme(axis.title.x = element_text(size = 13), axis.text.x = element_text(size = 10))
                    ),
                    sort_sets=FALSE,
                    queries=list(
                        upset_query(
                            intersect = names(binary_presence),
                            color='darkred',
                            fill='darkred',
                            only_components=c('intersections_matrix', 'No. of\nintersecting\nproteins')
                        ), 
                        upset_query(set = c("all"), fill = "orange")
                    ),
                    sort_intersections_by='ratio') + theme(text = element_text(size = 14), axis.title.x = element_text(size = 12))
  


ggsave(file.path(final_out_dir, "Technical_Repl_Intersection.png"), plot = upset_plot, width = 10, height = 6, bg = 'white')
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_text()`).
print(upset_plot)
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_text()`).

binary_presence <- dataset_2step_plus_in_3_of_4 %>%
  filter(Buffer_name == "AS") %>%
  distinct(ID, Repl_Techn) %>%
  pivot_wider(names_from = Repl_Techn, values_from = Repl_Techn, 
              values_fill = list(Repl_Techn = 0), 
              values_fn = list(Repl_Techn = function(x) 1))  %>%
  tibble::column_to_rownames(var = "ID")

binary_presence$all <- 1

binary_presence <- binary_presence[, c(4,3,2,1,5)]

upset_plotAS <- ComplexUpset::upset(data=binary_presence, 
                    intersect=names(binary_presence), name="Technical Replicates", 
                     wrap=TRUE,
                    base_annotations=list(
                        'Intersection size'=intersection_size(
                            text=list(size = 2.5))),
                    set_sizes=(
                        upset_set_size()
                        + geom_text(aes(label=..count..), hjust=1.1, stat='count')
                        + annotate(geom='text', label='@', x='Count', y=850, color='white', size=3)
                        + expand_limits(y=7000)
                    ),
                    sort_sets=FALSE,
                    queries=list(
                        upset_query(
                            intersect = names(binary_presence),
                            color='palegreen3',
                            fill='palegreen3',
                            only_components=c('intersections_matrix', 'Intersection size')
                        ), 
                        upset_query(set = c("all"), fill = "orange")
                    ),
                    sort_intersections_by='ratio') + ggtitle('Intersection of Quantified Proteins in 2Step+ in AS')

#ggsave(file.path(final_out_dir, "technical_repl_intersection.png"), plot = upset_plot, width = 10, height = 6, bg = 'white')

print(upset_plotAS)
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_text()`).

binary_presence <- dataset_2step_plus_in_3_of_4 %>%
  filter(Buffer_name == "AI") %>%
  distinct(ID, Repl_Techn) %>%
  pivot_wider(names_from = Repl_Techn, values_from = Repl_Techn, 
              values_fill = list(Repl_Techn = 0), 
              values_fn = list(Repl_Techn = function(x) 1))  %>%
  tibble::column_to_rownames(var = "ID")

binary_presence$all <- 1

binary_presence <- binary_presence[, c(4,3,2,1,5)]

upset_plotAI <- ComplexUpset::upset(data=binary_presence, 
                    intersect=names(binary_presence), name="Technical Replicates", 
                     wrap=TRUE,
                    base_annotations=list(
                        'Intersection size'=intersection_size(
                            text=list(size = 2.5))),
                    set_sizes=(
                        upset_set_size()
                        + geom_text(aes(label=..count..), hjust=1.1, stat='count')
                        + annotate(geom='text', label='@', x='Count', y=850, color='white', size=3)
                        + expand_limits(y=7000)
                    ),
                    sort_sets=FALSE,
                    queries=list(
                        upset_query(
                            intersect = names(binary_presence),
                            color='royalblue',
                            fill='royalblue',
                            only_components=c('intersections_matrix', 'Intersection size')
                        ), 
                        upset_query(set = c("all"), fill = "orange")
                    ),
                    sort_intersections_by='ratio') + ggtitle('Intersection of Quantified Proteins in 2Step+ in AI')

#ggsave(file.path(final_out_dir, "technical_repl_intersection.png"), plot = upset_plot, width = 10, height = 6, bg = 'white')

print(upset_plotAI)
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_text()`).

(upset_plot) / (upset_plotAS | upset_plotAI )  + plot_annotation(tag_levels = 'A')
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_text()`).
## Removed 1 row containing missing values or values outside the scale range
## (`geom_text()`).
## Removed 1 row containing missing values or values outside the scale range
## (`geom_text()`).

7.4 Boxplots

# Define a function to create and save plots
create_and_save_plot <- function(df, y_var, log_transform = FALSE, sub_dir, filename) {
  p <- ggplot(df, aes(x = as.factor(Replicate), y = if (log_transform) log10(!!sym(y_var)) else !!sym(y_var), fill = Buffer_name)) +
    geom_boxplot() +
    facet_grid( Buffer_name ~ Variance_Type, scales = "free_x") +
    labs(title = paste("Biological and Technical Variance in", y_var),
         x = "Replicate", y = if (log_transform) paste("Log10(", y_var, ")", sep = "") else y_var) +
    scale_fill_manual(values = c("AI" = "royalblue", "AS" = "palegreen3")) +
    theme_minimal()
  
  # Print the plot
  print(p)
  
  # Save the plot
  ggsave(file.path(sub_dir, filename), plot = p, width = 12, height = 8, bg = 'white')
  
  return(p)
}


# Plot for MeanLFQ with log10 transformation
meanlfq <- create_and_save_plot(combined_variance_stats, "MeanLFQ", log_transform = TRUE, final_out_dir, "Variance_MeanLFQ_Boxplot.png")

# Plot for SDLFQ with log10 transformation
sdlfq <- create_and_save_plot(combined_variance_stats, "SDLFQ", log_transform = TRUE, final_out_dir, "Variance_SDLFQ_Boxplot.png")
## Warning: Removed 2788 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Removed 2788 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

# Plot for CVLFQ without log10 transformation
cvlfq <- create_and_save_plot(combined_variance_stats, "CVLFQ", log_transform = FALSE, final_out_dir, "Variance_CVLFQ_Boxplot.png")
## Warning: Removed 2788 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Removed 2788 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

library(ggpubr)
## 
## Attaching package: 'ggpubr'
## The following object is masked from 'package:VennDiagram':
## 
##     rotate
## The following object is masked from 'package:cowplot':
## 
##     get_legend
library(rstatix)
## 
## Attaching package: 'rstatix'
## The following object is masked from 'package:IRanges':
## 
##     desc
## The following object is masked from 'package:stats':
## 
##     filter
# Define a function to create and save summary plots
create_and_save_summary_plot <- function(df, y_var, log_transform = FALSE, sub_dir, filename) {
  # Modify the variance type for visualization
  df_mut <- df %>% mutate(Variance_Type = str_replace_all(Variance_Type, " ", "\n"))
  
  # Perform the t-test for each Buffer_name group
  stat.test <- df_mut %>%
    group_by(Buffer_name) %>%
    t_test(CVLFQ ~ Variance_Type) %>%
    add_significance() # %>%
  #  mutate(
   #   effsize = round(effsize, 2),  # Round effect size to 2 decimal places
   #   label = paste(effsize, " (", magnitude, ")", sep = "")  # Format label with magnitude
   # )

  # Create the plot
  p <-  ggpubr::ggboxplot(df_mut, x = "Variance_Type", y = y_var, fill = "Buffer_name",
                 palette = c("AI" = "royalblue", "AS" = "palegreen3")) +
    facet_wrap(~ Buffer_name) +
    labs(title = "", fill = "Buffer Type",
         x = "", y = if (log_transform) paste("Log10(", y_var, ")", sep = "") else "Coefficient of Variation in % (malLFQ)") +
    theme_minimal() + 
    theme(text = element_text(size = 14)) +
    scale_x_discrete(labels = named_labels)

  # Add t-test results as text annotations
  p <- p + ggpubr::stat_pvalue_manual(stat.test, label = "{p.signif}", 
                     y.position = max(df$CVLFQ, na.rm = TRUE) * 1.05, 
                     size = 4, color = "black")

  # Print the plot
  print(p)

  # Save the plot
  ggsave(file.path(sub_dir, filename), plot = p, width = 9, height = 6, bg = 'white')
  
  return(p)
}

# Generate the summary plot for CVLFQ without log10 transformation
summary_cvlfq <- create_and_save_summary_plot(mean_variance_stats, "CVLFQ", log_transform = FALSE, final_out_dir, "Summary_CVLFQ_Boxplot.png")
## Warning: Removed 632 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: No shared levels found between `names(values)` of the manual scale and the
## data's colour values.
## Warning: Removed 632 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: No shared levels found between `names(values)` of the manual scale and the
## data's colour values.

7.5 Final

# Combining the plots
combined_plot_2stepplus <- (summary_cvlfq | upset_plot)  + plot_annotation(tag_levels = 'A') + plot_layout(widths = c(1,2)) +
  plot_layout(guides = "collect") & theme(legend.position = "bottom")


# Save the combined figure
ggsave(file.path(final_out_dir, "2stepplus.png"), plot = combined_plot_2stepplus, width = 14, height = 6, dpi = 200)
## Warning: Removed 632 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: No shared levels found between `names(values)` of the manual scale and the
## data's colour values.
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_text()`).
print(combined_plot_2stepplus)
## Warning: Removed 632 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: No shared levels found between `names(values)` of the manual scale and the
## data's colour values.
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_text()`).

8 Check Collagens

dataset_filtered_in_3_of_4_col <- dataset_filtered_in_3_of_4 %>%
  filter(str_detect(Genes, "Col\\d"))
library(dplyr)
library(ggplot2)

# Calculate mean maxLfQ_intensities for dataset_filtered_in_3_of_4_col
mean_intensities_col <- dataset_filtered_in_3_of_4_col %>%
  group_by(Method, Buffer_name, ID) %>%
  summarize(mean_intensity = mean(maxLfQ_intensities, na.rm = TRUE)) %>%
  ungroup()
## `summarise()` has grouped output by 'Method', 'Buffer_name'. You can override
## using the `.groups` argument.
# Calculate mean maxLfQ_intensities for dataset_filtered_in_3_of_4
mean_intensities_all <- dataset_filtered_in_3_of_4 %>%
  group_by(Method, Buffer_name, ID) %>%
  summarize(mean_intensity = mean(maxLfQ_intensities, na.rm = TRUE)) %>%
  ungroup()
## `summarise()` has grouped output by 'Method', 'Buffer_name'. You can override
## using the `.groups` argument.
# Count the total number of IDs per Method x Buffer_name
total_ids_all <- mean_intensities_all %>%
  group_by(Method, Buffer_name) %>%
  summarize(total_ids = n())
## `summarise()` has grouped output by 'Method'. You can override using the
## `.groups` argument.
# Count the total number of IDs in _col per Method x Buffer_name
total_ids_col <- mean_intensities_col %>%
  group_by(Method, Buffer_name) %>%
  summarize(total_ids_col = n())
## `summarise()` has grouped output by 'Method'. You can override using the
## `.groups` argument.
# Merge the counts
total_ids <- total_ids_all %>%
  left_join(total_ids_col, by = c("Method", "Buffer_name")) %>%
  replace_na(list(total_ids_col = 0))

# Barplot
ggplot(total_ids, aes(x = Buffer_name, y = total_ids, fill = "All IDs")) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_bar(aes(y = total_ids_col, fill = "Collagen IDs"), stat = "identity", position = "dodge") +
  geom_text(aes(y = total_ids, label = total_ids), vjust = -0.5, position = position_dodge(width = 0.9)) +
  geom_text(aes(y = total_ids_col, label = total_ids_col), vjust = -0.5, position = position_dodge(width = 0.9), color = "white") +
  labs(title = "Total Number of IDs per Method x Buffer_name",
       x = "Method x Buffer_name",
       y = "Number of IDs") +
  scale_fill_manual(name = "ID Type", values = c("All IDs" = "blue", "Collagen IDs" = "red")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 0), strip.text.x = element_text(size = 10), 
        panel.spacing = unit(0.1, "lines"), panel.border = element_rect(color = "black", fill = NA, size = 1)) +
  facet_grid(~ Method, scales = "free", space = "free") +
  ylim(0,5000)

# Add a column to indicate the dataset source
mean_intensities_all <- mean_intensities_all %>% mutate(Source = "All")
mean_intensities_col <- mean_intensities_col %>% mutate(Source = "Collagen")

# Combine the datasets
combined_intensities <- bind_rows(mean_intensities_all, mean_intensities_col)

# Boxplot
ggplot(combined_intensities, aes(x = Buffer_name, y = log10(mean_intensity), fill = Source)) +
  geom_boxplot() +
  labs(title = "Intensities for All and Collagen IDs per Method x Buffer_name",
       x = "Method x Buffer_name",
       y = "Intensity") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 0), strip.text.x = element_text(size = 10), 
        panel.spacing = unit(0.1, "lines"), panel.border = element_rect(color = "black", fill = NA, size = 1)) +
  scale_fill_manual(name = "ID Type", values = c("All" = "blue", "Collagen" = "red")) +
  facet_grid(~ Method, scales = "free", space = "free") 

# Sum of intensities for all IDs
sum_intensities_all <- mean_intensities_all %>%
  group_by(Method, Buffer_name) %>%
  summarize(sum_intensity_all = sum(mean_intensity))
## `summarise()` has grouped output by 'Method'. You can override using the
## `.groups` argument.
# Sum of intensities for collagen IDs
sum_intensities_col <- mean_intensities_col %>%
  group_by(Method, Buffer_name) %>%
  summarize(sum_intensity_col = sum(mean_intensity))
## `summarise()` has grouped output by 'Method'. You can override using the
## `.groups` argument.
# Merge the sums
sum_intensities <- sum_intensities_all %>%
  left_join(sum_intensities_col, by = c("Method", "Buffer_name")) %>%
  replace_na(list(sum_intensity_col = 0))

# Calculate the percentage of intensities in _col
sum_intensities <- sum_intensities %>%
  mutate(percentage_col = (sum_intensity_col / sum_intensity_all) * 100)

# Barplot for sum of intensities
ggplot(sum_intensities, aes(x = Buffer_name, y = sum_intensity_all, fill = "All Intensities")) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_bar(aes(y = sum_intensity_col, fill = "Collagen Intensities"), stat = "identity", position = "dodge") +
  labs(title = "Sum of Intensities per Method x Buffer_name",
       x = "Method x Buffer_name",
       y = "Sum of Intensities") +
  scale_fill_manual(name = "Intensity Type", values = c("All Intensities" = "blue", "Collagen Intensities" = "red")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 0), strip.text.x = element_text(size = 10), 
        panel.spacing = unit(0.1, "lines"), panel.border = element_rect(color = "black", fill = NA, size = 1)) +
  facet_grid(~ Method, scales = "free", space = "free") 

# Print the percentages
print(sum_intensities)
## # A tibble: 10 × 5
## # Groups:   Method [5]
##    Method  Buffer_name sum_intensity_all sum_intensity_col percentage_col
##    <chr>   <fct>                   <dbl>             <dbl>          <dbl>
##  1 1step   single             310178429.         23301675.           7.51
##  2 1step+C single             141478011.          4264569.           3.01
##  3 2step   AS                 202308694.          9666053.           4.78
##  4 2step   AI                 165454203.         11862848.           7.17
##  5 2step+M AS                 221498463.         21260730.           9.60
##  6 2step+M AI                 176002435.         13063953.           7.42
##  7 4step   AS1                211851742.          9326736.           4.40
##  8 4step   AI1                175606308.         10654546.           6.07
##  9 4step   AI2                153282257.         20497227.          13.4 
## 10 4step   AS2                254838181.        144248511.          56.6
# Sum of intensities for all IDs
sum_intensities_all <- mean_intensities_all %>%
  group_by(Method, Buffer_name) %>%
  summarize(sum_intensity_all = sum(mean_intensity))
## `summarise()` has grouped output by 'Method'. You can override using the
## `.groups` argument.
# Sum of intensities for collagen IDs
sum_intensities_col <- mean_intensities_col %>%
  group_by(Method, Buffer_name) %>%
  summarize(sum_intensity_col = sum(mean_intensity))
## `summarise()` has grouped output by 'Method'. You can override using the
## `.groups` argument.
# Merge the sums
sum_intensities <- sum_intensities_all %>%
  left_join(sum_intensities_col, by = c("Method", "Buffer_name")) %>%
  replace_na(list(sum_intensity_col = 0))

# Calculate the percentage of intensities in _col
sum_intensities <- sum_intensities %>%
  mutate(percentage_col = (sum_intensity_col / sum_intensity_all) * 100,
         percentage_all = 100)

# Barplot for sum of intensities
p <- ggplot(sum_intensities, aes(x = Buffer_name, y = percentage_all, fill = "All Intensities")) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_bar(aes(y = percentage_col, fill = "Collagen Intensities"), stat = "identity", position = "dodge") +
  geom_text(aes(y = percentage_col, label = sprintf("%.2f%%", percentage_col)), vjust = -0.5, position = position_dodge(width = 0.9), color = "white") +
  labs(title = "Percentage of Intensities per Method x Buffer_name",
       x = "Method x Buffer_name",
       y = "Percentage of Intensities") +
  scale_fill_manual(name = "Intensity Type", values = c("All Intensities" = "blue", "Collagen Intensities" = "red")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 0), strip.text.x = element_text(size = 10), 
        panel.spacing = unit(0.1, "lines"), panel.border = element_rect(color = "black", fill = NA, size = 1)) +
  facet_grid(~ Method, scales = "free", space = "free") 

# Print the percentages
print(sum_intensities)
## # A tibble: 10 × 6
## # Groups:   Method [5]
##    Method  Buffer_name sum_intensity_all sum_intensity_col percentage_col
##    <chr>   <fct>                   <dbl>             <dbl>          <dbl>
##  1 1step   single             310178429.         23301675.           7.51
##  2 1step+C single             141478011.          4264569.           3.01
##  3 2step   AS                 202308694.          9666053.           4.78
##  4 2step   AI                 165454203.         11862848.           7.17
##  5 2step+M AS                 221498463.         21260730.           9.60
##  6 2step+M AI                 176002435.         13063953.           7.42
##  7 4step   AS1                211851742.          9326736.           4.40
##  8 4step   AI1                175606308.         10654546.           6.07
##  9 4step   AI2                153282257.         20497227.          13.4 
## 10 4step   AS2                254838181.        144248511.          56.6 
## # ℹ 1 more variable: percentage_all <dbl>
print(p)

9 Rank Groups

# Calculate mean maxLfQ_intensities for dataset_filtered_in_3_of_4_col
mean_intensities_col <- dataset_filtered_in_3_of_4_col %>%
  group_by(Method, Buffer_name, ID, Genes) %>%
  summarize(mean_intensity = mean(maxLfQ_intensities, na.rm = TRUE)) %>%
  ungroup()
## `summarise()` has grouped output by 'Method', 'Buffer_name', 'ID'. You can
## override using the `.groups` argument.
# Calculate mean maxLfQ_intensities for dataset_filtered_in_3_of_4
mean_intensities_all <- dataset_filtered_in_3_of_4 %>%
  group_by(Method, Buffer_name, ID, Genes) %>%
  summarize(mean_intensity = mean(maxLfQ_intensities, na.rm = TRUE)) %>%
  ungroup()
## `summarise()` has grouped output by 'Method', 'Buffer_name', 'ID'. You can
## override using the `.groups` argument.
# Rank the data by mean_intensity for each Method x Buffer_name combination
ranked_data <- mean_intensities_all %>%
  group_by(Method, Buffer_name) %>%
  mutate(rank = rank(-mean_intensity, ties.method = "first")) %>%
  ungroup()

ranked_data <- ranked_data %>%
  mutate(Buffer_name = factor(Buffer_name, levels = c("single", "AS", "AI", "AS1", "AI1", "AI2", "AS2"))) %>%
  mutate(Buffer_name = recode(Buffer_name, "single" = "Single"))
# Create the plot

p <- ggplot(ranked_data, aes(x = rank, y = log10(mean_intensity), color = Buffer_name)) +
  geom_point(size = 0.5) +
  labs(title = "Log10 Mean Intensity vs Rank by Method",
       x = "Rank",
       y = "Log10 Mean Intensity",
       color = "Buffer Name") +
  facet_wrap(~ Method, ncol = 2) + #, scales = "free_x") +  
  theme_bw() +
  theme(strip.text = element_text(size = 10), 
        panel.spacing = unit(0.5, "lines"), 
        panel.border = element_rect(color = "black", fill = NA, size = 1)) +
  guides(color = guide_legend(override.aes = list(size = 5))) 

ggsave(file.path(final_out_dir, "Supplement1.png"), plot = p, width = 8, height = 4)

print(p)

library(purrr)
ranked_data_col <- ranked_data %>% filter(str_detect(Genes, "Col\\d"))

# Create the table, sorted by Genes
ranked_table <- ranked_data_col %>%
  group_by(Method, Buffer_name) %>%
  arrange(rank, .by_group = TRUE) %>%
  select(Method, Buffer_name, Rank = rank, ID, Genes, mean_intensity) %>%
  ungroup()

# Split the dataframe into a list of dataframes, one for each Method_Buffer_name combination
ranked_table_list <- ranked_table %>%
  split(list(.$Method, .$Buffer_name))

# Save each sub-dataframe as a CSV file
purrr::walk2(names(ranked_table_list), ranked_table_list, function(name, data) {
  if(nrow(data) > 0) {  # Check if the dataframe is not empty
    filename <- paste0(gsub(" ", "_", name), ".csv")
    write.csv(data, filename, row.names = FALSE)
  }
})
# Function to get the rows with the highest and lowest mean intensity for each group
get_extreme_rows <- function(df) {
  df %>%
    filter(mean_intensity == max(mean_intensity) | mean_intensity == min(mean_intensity))
}

# Apply the function to each Method x Buffer_name group
filtered_data <- ranked_data %>%
  group_by(Method, Buffer_name) %>%
  do(get_extreme_rows(.))

# Ungroup the data for cleaner output
filtered_data <- filtered_data %>% ungroup()
filtered_data$log10 <- log10(filtered_data$mean_intensity)
# Plot the data
ggplot(ranked_data, aes(x = rank, y = log10(mean_intensity), color = Buffer_name)) +
  geom_point() +
  labs(title = "Log10 Mean Intensity vs Rank by Method",
       x = "Rank",
       y = "Log10 Mean Intensity",
       color = "Buffer Name") +
  facet_wrap(~ Method, ncol = 2) +
  theme_minimal() +
  theme(strip.text = element_text(size = 10), 
        panel.spacing = unit(0.5, "lines"), 
        panel.border = element_rect(color = "black", fill = NA, size = 1))

10 Supplement

10.1 Scatter

library(ggpointdensity)
library(viridis)
## Loading required package: viridisLite
## 
## Attaching package: 'viridis'
## The following object is masked from 'package:scales':
## 
##     viridis_pal
# Function to create scatterplot with correlation
create_scatterplot_with_correlation <- function(df, method_label, log_scale = FALSE) {
  # Step 1: Filter the dataset for the specified method
  filtered_df <- df %>%
    filter(Method == method_label)

  # Step 2: Identify shared IDs
  shared_ids <- filtered_df %>%
    group_by(ID) %>%
    filter(n_distinct(Buffer_name) > 1) %>%
    pull(ID) %>%
    unique()

  # Step 3: Subset the data for shared IDs
  shared_data <- filtered_df %>%
    filter(ID %in% shared_ids)

  # Step 4: Reshape the data to have one row per ID with columns for each buffer
  shared_data_wide <- shared_data %>%
    select(-SDLFQ, -CVLFQ) %>%
    pivot_wider(names_from = Buffer_name, values_from = MeanLFQ, names_prefix = "MeanLFQ_")
  
  # Step 5: Apply log scale if flag is set
  if (log_scale) {
    shared_data_wide <- shared_data_wide %>%
      mutate(across(starts_with("MeanLFQ_"), log2))
  }

  # Step 6: Calculate the correlation
  correlation <- cor(shared_data_wide$MeanLFQ_AS, shared_data_wide$MeanLFQ_AI, use = "complete.obs")

  # Step 7: Create the scatterplot
  p <- ggplot(shared_data_wide, aes(x = MeanLFQ_AS, y = MeanLFQ_AI)) +
    #geom_point(color = "#0072B2") +
    geom_pointdensity(size = 0.5) + scale_color_viridis(name = "Density") +
    geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "gray") +
    labs(title = paste("Scatterplot of Shared IDs between\nAS and AI Buffers in", method_label, "Method"),
         x = if (log_scale) "Log2 MeanLFQ Intensity in AS Buffer" else "MeanLFQ Intensity in AS Buffer",
         y = if (log_scale) "Log2 MeanLFQ Intensity in AI Buffer" else "MeanLFQ Intensity in AI Buffer") +
    theme_minimal(base_size = 14) +  # Adjust base size for smaller text
      coord_fixed(xlim = c(10, 25), ylim = c(10, 25)) +  # Fixed x and y axis limits
      scale_x_continuous(breaks = seq(10, 25, by = 5)) +
      scale_y_continuous(breaks = seq(10, 25, by = 5)) +
      annotate("text", x = Inf, y = -Inf, label = paste("Corr.:", round(correlation, 2)), 
               hjust = 1.1, vjust = -1.5, size = 4, color = "black")
  
  print(p)
}

# Create scatterplots for "2step" and "2step+"
create_scatterplot_with_correlation(stats_df, "2step", log_scale = T)

#create_scatterplot_with_correlation(stats_df, "2step+M", log_scale = T)

for 2step+M in technical variance

library(ggpointdensity)
library(viridis)
library(patchwork)  # To arrange plots

# Function to create scatterplot with correlation for multiple replicates
create_scatterplot_with_correlation_multiple_replicates <- function(df, method_label, log_scale = FALSE) {
  # Step 1: Filter the dataset for the specified method
  filtered_df <- df %>%
    filter(Method == method_label)
  
  filtered_df$Replicate <- factor(filtered_df$Replicate, levels = c(1, 2, 3, 4))

  # Step 2: Identify shared IDs for each replicate
  shared_ids <- filtered_df %>%
    group_by(ID) %>%
    filter(n_distinct(Buffer_name) > 1) %>%
    pull(ID) %>%
    unique()

  # Step 3: Subset the data for shared IDs
  shared_data <- filtered_df %>%
    filter(ID %in% shared_ids)

  # Step 4: Reshape the data to have one row per ID with columns for each buffer
  shared_data_wide <- shared_data %>%
    select(-SDLFQ, -CVLFQ) %>%
    pivot_wider(names_from = Buffer_name, values_from = MeanLFQ, names_prefix = "MeanLFQ_")
  
  # Step 5: Apply log scale if flag is set
  if (log_scale) {
    shared_data_wide <- shared_data_wide %>%
      mutate(across(starts_with("MeanLFQ_"), log2))
  }

  # Step 6: Calculate correlation for each replicate and add as a new column
  correlation_df <- shared_data_wide %>%
    group_by(Replicate) %>%
    summarize(corr_value = round(cor(MeanLFQ_AS, MeanLFQ_AI, use = "complete.obs"), 2))
  
  # Merge correlation data back to the main data frame for annotation
  shared_data_wide <- left_join(shared_data_wide, correlation_df, by = "Replicate")

  # Step 7: Create a scatterplot with facets for each replicate
  p <- ggplot(shared_data_wide, aes(x = MeanLFQ_AS, y = MeanLFQ_AI)) +
    geom_pointdensity(size = 0.3) +
    scale_color_viridis(name = "Density") +
    geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "gray") +
    labs(title = paste("Scatterplot of Shared IDs between\nAS and AI Buffers in", method_label, "Method"),
         x = if (log_scale) "Log2 MeanLFQ Intensity in AS Buffer" else "MeanLFQ Intensity in AS Buffer",
         y = if (log_scale) "Log2 MeanLFQ Intensity in AI Buffer" else "MeanLFQ Intensity in AI Buffer") +
    theme_minimal(base_size = 14) +  # Adjust base size for smaller text
    coord_fixed(xlim = c(10, 25), ylim = c(10, 25)) +  # Fixed x and y axis limits
    scale_x_continuous(breaks = seq(10, 25, by = 5)) +
    scale_y_continuous(breaks = seq(10, 25, by = 5)) +
    facet_wrap(~ Replicate, ncol = 2, labeller = labeller(Replicate = function(x) paste("Technical Replicate", x))) +
    geom_text(data = correlation_df, aes(x = 25, y = 11, label = paste("Corr.:", corr_value)), 
              size = 4, color = "black", hjust = 1, vjust = 1)  # Position text at the bottom right
  
  print(p)
  return(p)
}


# Create scatterplots for "2step" from stats_df
scatter_2step <- create_scatterplot_with_correlation(stats_df, "2step", log_scale = TRUE)

# Create scatterplots for "2step+M" from stats_df_bio, separated by replicates
scatter_2stepplus <- create_scatterplot_with_correlation_multiple_replicates(stats_df_bio, "2step+M", log_scale = TRUE)
## Warning: Removed 1570 rows containing non-finite outside the scale range
## (`stat_pointdensity()`).

# Combine scatter_2step and scatter_2stepplus into one plot with tags A and B
combined_plot <- (scatter_2step | scatter_2stepplus) + 
                 #plot_layout(widths = c(0.49, 0.51)) + 
                 plot_annotation(tag_levels = 'A')

# Save the combined figure
ggsave(file.path(final_out_dir, "Supplement2.png"), plot = combined_plot, width = 12, height = 9)
## Warning: Removed 1570 rows containing non-finite outside the scale range
## (`stat_pointdensity()`).
print(combined_plot)
## Warning: Removed 1570 rows containing non-finite outside the scale range
## (`stat_pointdensity()`).

10.2 Violin

Mean per Protein.Group

library(scales)
plot_df <- merged_attributes_df_repl[merged_attributes_df_repl$Method %in% c("2step", "2step+M")]

# Function to plot each property
plot_property <- function(df, property, ylab, log_scale = FALSE) {
  p <- ggplot(df, aes(x = Buffer_name, y = get(property), fill = Method)) +
    geom_split_violin() +
    theme_minimal() +
    scale_fill_manual(values = c("#E69F00", "#56B4E9")) +
    theme(axis.title.x = element_blank(), legend.position = "none") +
    theme(text = element_text(size = 14),axis.text.x = element_text(size = 14)) 
  
  if (log_scale) {
    p <- p + scale_y_log10(labels = label_number()) +
      labs(title = "", x = "Buffer Type", y = ylab) +
     theme(
    text = element_text(size = 14),
    axis.text.x = element_text(size = 14),  # Larger x-axis labels
    axis.text.y = element_text(size = 10),  # Log10 scaling applied
    panel.grid.minor.y = element_line(color="grey", linetype="dotted"), 
    panel.grid.major.y = element_line(color="grey", linetype="solid")
  ) + scale_y_log10(breaks = breaks, minor_breaks = minor_breaks, labels = label_number()) +
  annotation_logticks(sides="l")
  } else {
    p <- p + labs(title = "", x = "Buffer Type", y = ylab) 
  }
  
  return(p)
}

plot_df$mean_mw <- plot_df$mean_mw/1000
# Plot for pI
plot_pI <- plot_property(plot_df, "mean_pI", ylab = "Isoelectric point (pI)")

# Plot for mw with logarithmic scale
plot_mw <- plot_property(plot_df, "mean_mw", ylab = "Molecular weight [kDa]", log_scale = TRUE)
## Scale for y is already present.
## Adding another scale for y, which will replace the existing scale.
# Plot for gravy
plot_gravy <- plot_property(plot_df, "mean_gravy", ylab = "GRAVY score")
# Combine the plots using patchwork and add a global title
combined_plot <- (plot_pI | plot_mw | plot_gravy) + 
  plot_annotation(tag_levels = 'A') +
  plot_layout(guides = "collect") & 
   theme(
    legend.position = "bottom",
    plot.tag = element_text(size = 14)  # Adjust size as needed for tag consistency
  )

ggsave(file.path(final_out_dir, "Supplement3.png"), plot = combined_plot, width = 8, height = 4)
## Warning: Removed 110 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Removed 110 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Removed 110 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
# Display the combined plot
print(combined_plot)
## Warning: Removed 110 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Removed 110 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Removed 110 rows containing non-finite outside the scale range
## (`stat_ydensity()`).